Log In Sign Up

Interpretable Deep Learning in Drug Discovery

by   Kristina Preuer, et al.

Without any means of interpretation, neural networks that predict molecular properties and bioactivities are merely black boxes. We will unravel these black boxes and will demonstrate approaches to understand the learned representations which are hidden inside these models. We show how single neurons can be interpreted as classifiers which determine the presence or absence of pharmacophore- or toxicophore-like structures, thereby generating new insights and relevant knowledge for chemistry, pharmacology and biochemistry. We further discuss how these novel pharmacophores/toxicophores can be determined from the network by identifying the most relevant components of a compound for the prediction of the network. Additionally, we propose a method which can be used to extract new pharmacophores from a model and will show that these extracted structures are consistent with literature findings. We envision that having access to such interpretable knowledge is a crucial aid in the development and design of new pharmaceutically active molecules, and helps to investigate and understand failures and successes of current methods.


page 1

page 2

page 3

page 4


Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks

In de novo drug design, computational strategies are used to generate no...

Learn molecular representations from large-scale unlabeled molecules for drug discovery

How to produce expressive molecular representations is a fundamental cha...

A deep-learning view of chemical space designed to facilitate drug discovery

Drug discovery projects entail cycles of design, synthesis, and testing ...

Black Box Recursive Translations for Molecular Optimization

Machine learning algorithms for generating molecular structures offer a ...

Equivariant 3D-Conditional Diffusion Models for Molecular Linker Design

Fragment-based drug discovery has been an effective paradigm in early-st...

Gini in a Bottleneck: Gotta Train Me the Right Way

Due to the nature of deep learning approaches, it is inherently difficul...

Evaluation Metrics for Symbolic Knowledge Extracted from Machine Learning Black Boxes: A Discussion Paper

As opaque decision systems are being increasingly adopted in almost any ...

1 Introduction

The central goal of drug discovery research is to identify molecules To be published in "Interpretable AI: Interpreting, Explaining and Visualizing Deep Learning" that act beneficially on the human (or animal) system, e.g., that have a certain therapeutic effect against particular diseases. It is generally unknown how chemical structures have to look like in order to induce the wanted biological effects. Therefore, a large number of molecules have to be investigated to find a potential drug, leading to long drug identification times, and high costs. This is typically done by means of High-Throughput Screening (HTS), where a biological screening experiment is used to identify whether a molecule at a given concentration exhibits a certain biological effect or not. However, running a large number of these experiments is expensive and time-intensive. Therefore, using computational models as a means of “Virtual Screening”, i.e. to predict these biological effects using computational methods and thereby avoiding physical screening, has a long tradition in drug development [17, 15].

In the past years, the advent of deep learning has allowed neural networks to become the best-performing method for predicting biological activities based on the chemical structure of the molecules  [19, 21] mostly because of their ability to exploit the multi-task setting [31]. Recently, deep learning enabled automated molecule generation [28, 22, 35], which has become a new interesting application in the field of drug design. However, some generative models still have problems with mode collapse [33] and are hard to evaluate [25]. Interpretability of neural networks both for predictions and automated drug design could further push their performance, would increase their usability and would especially improve acceptance.

Nowadays, mainly two types of deep neural networks are most frequently used in Virtual Screening: descriptor-based feed-forward neural networks (see Section 


) and graph convolutional neural networks (see Section 

4). Descriptor-based neural networks rely on predefined features, so-called molecular descriptors, whereas graph convolutional neural networks learn a continuous representation directly from the molecular graph. Neural networks take these discrete or numerical representation of a chemical molecule and calculate their prediction by feeding that representation through several layers of non-linear, differentiable transformations with many, often millions of, adjustable parameters. Unfortunately, the function that is encoded in such a neural network is typically impossible to interpret by humans. In other words: how the neural network reaches a conclusion is usually beyond the understanding of a human user. This work aims at bridging this gap in our understanding of neural network predictions for drug discovery. Although [9, 2, 27] have already focused on the difficult question how machine learning models can be interpreted, none of these works focus on an in depth analysis of both descriptor based and graph based deep learning models for QSAR predictions.

In this work, we first show how a trained neural network can be used to interpret which parts of a molecule are important for its biological properties, and then demonstrate how graph convolutional neural networks can be used to extract annotated chemical substructures, known as pharmacophores or toxicophores. We will empirically show that neural networks rely on pharmacophore-like features to reach their conclusions, similar to how a human pharmacologist would. Concretely, we will show in subsection 3.1 that the units that form the layers of a neural network are pharmacophore detectors. Furthermore, we will demonstrate in subsection 3.2 how indicative substructures can be determined for individual samples. In the second part of our analysis we will focus on graph convolutional neural networks and show that the identified pharmacophores extracted directly from the network match well-known, annotated substructures from the literature.

Figure 1: Example of a 2 dimensional structure representation of a molecule.

2 Learning from molecular structures

There are multiple scales at which molecules such as the example shown in Fig. 1 can be represented. Molecules can be represented by their molecular formula (1D), by their structural formula (2D), by their conformation (3D), by their mutual orientation and time-dependent dynamics or combinations of all these [5]. The choice of the right representation is task dependent and crucial for the learning algorithm. Most commonly, molecules are described by so-called Extended Connectivity Fingerprints (ECFPs)[26]

. These 2D-descriptors represent the 2-dimensional structure as a bit vector indicating the presence and absence of predefined substructures and showed a high predictive performance in

[32, 20, 24, 18]. Therefore, we will use these as descriptors for our first experiments described in section 3.
However, newer approaches focus on direct end-to-end methods, where the molecular representation is directly learned from the molecular graph [7, 8, 13]. These graph convolutional methods learn molecular representations during the training process and are therefore able to learn wildcards and flexible substructures. We will analyse the representations learned by a graph convolutional network in section 4.

3 Descriptor-based Feed Forward Neural Networks

A feed forward neural network consists of several layers of computing units. Each layer takes as its input the vector of outputs of the layer below it. In the first layer, we use the input data instead. The layer transforms its input according to some parameterized function to produce its own output


is an activation function that is applied to each element, or “hidden unit”,

of the vector individually. Each of these elements can be understood as a feature detector, which detects the presence or absence of some feature in its inputs. The nature of that feature is defined by the learned parameters , but is usually very difficult to interpret, as the features are typically a highly abstract, non-linear function of the input features. However, we can show that when learned on typical drug development tasks, these hidden units encode features that are very similar to features used by pharmaceutical researchers for decades.

3.1 Interpreting Hidden Neurons

A common way to analyze chemical properties of small molecules is by looking at its structures. Atoms that are close together often form functional groups, which may have specific roles for binding to the respective biological targets. These functional groups form larger structures which then are responsible for the biological effect, by modulating a biological target (e.g. a protein or DNA). work together to build reactive centers that steer chemical reactions. Binding to a specific target can only take place when the necessary active centers are present at exactly the right locations. The exact configuration of active centers is referred to as a pharmacophore [16]. In other words, a pharmacophore is a molecular substructure, or a set of molecular substructures that is responsible for a specific interaction between chemical molecules and biological targets. It is our hypothesis that the hidden units of a neural network learn to detect pharmacophores. To investigate this, we employed a strategy similar to [3], and trained a network that predicts the toxicology of molecules, using the data set from the Tox21 Data Challenge [11]. The data set contains around 12 000 molecules, for each of which twelve biological effects were measured in wet lab experiments with binary outcome (“toxic”, “non-toxic”). These twelve biological effects served pose a multi-task classification problem. Deep Learning is the best performing method for this task [20, 14], and we follow the network architecture outlined in [14] for our experiments, using a network with 4 hidden layers of 1024 hidden units with SELU activation function. To represent our molecules, we use ECFPs [26] of radius 1, meaning that the input representation includes presence/absence calls of single atoms and small substructures with at most 5 atoms, but gives no concrete information about larger molecular substructures. The network still performed relatively well, with an average AUC over the 12 targets of , which would still place it among the top 10 models in the original Tox21 Data Challenge [20].

After training the network, we calculate the activation of the hidden units for all molecules in the training set, and relate them with presence/absence calls of pharmacophores calculated for the same molecules. For this, we used pharmacophores known to be relevant in toxicology [30], the so-called toxicophores. Starting from all the toxicophores in [30], we filter out those that were were present in less than 20 of our molecules, leaving us with a total of about 650 toxicophores. For each hidden unit and each toxicophore , we then performed a Mann–Whitney U-Test to see if there was a significant difference in the activations between the molecules where a given toxicophore was present and the ones where it was absent. We then looked at correlations that were significant at after adjusting for the multiple testing using a very conservative Bonferoni correction. This leaves us with a total of 290 toxicophores that were significantly correlated with hidden units of the network.

Next, we investigated whether the hierarchy of the layers is associated with the complexity of the detected toxicophores. If a network learns some biologically meaningful information in one of its layers, it will still need to transport this information through all other layers to use it in its final prediction at the top layer. This means that every important toxicophore which is discovered in a lower layer will usually also reappear in all subsequent layers. Figure 2 shows which layers are discovering our known pharmacophores. It appears that the pharmacophores are primarily discovered in the first few layers. The later layers tend to mainly discover pharmacophores that are more complex. Here, we measure the complexity of a pharmacophore by the number of atoms involved in it. The results are well in line with the usual view of deep learning constructing more and more complex features in its higher layers [4].

We have demonstrated that neural networks learn pharmacophore detectors by correlating the hidden units with known toxicophores. However, not all samples contain a known toxicophore. Hence, we will demonstrate in the next section how indicative substructures can be identified for any input molecule.

Figure 2: Left: How many pharmacophores were found in each layer of the network. Right:

Average size of the pharmacophore (in number of involved atoms) that are first discovered in a given layer. Error bars are standard errors. Note that there are no error bars for layer number 8, as only one pharmacophore was first discovered here.

3.2 Interpreting the Importance of Input Components

Several ideas have been proposed to explain the predictions from a neural network by attributing its decisions to specific input features. See Ancona et al. [1] for a short overview of gradient-based methods. One of these methods is Integrated Gradients [29], which calculates an attribution value for each input dimension . The value can be interpreted as the contribution of to changing the prediction from a baseline input to some specific input . It fulfills many useful properties, for example, it is guaranteed that . Additionally, Integrated Gradients is the only method that works well even when there are multiplicative interactions between features of those considered by [1]

. Furthermore, it is independent of the concrete choice of architecture, activation function or other hyperparameters. The method works by aggregating the gradients of the model’s output on the straight path between

and the target . When implementing the method, this integral is approximated by a sum:



describes the interpolation path and

is the number of steps in the approximation that controls how exact the results will be. In our experiments, we obtained good results using an of . We used a zero vector as baseline for the feed-forward network, which represents a molecule in which all substructures are set to absent.

Alcohol Toy Data Set.

As a proof of concept, we investigate if Integrated Gradients can be used to extract interpretable pharmacophores from a feedforward neural network. For this purpose we have constructed a toy data set which classifies compounds based on a simple rule: compounds containing an alcohol group (i.e. a hydroxy group bound to saturated carbon) are classified as positive, whereas compounds containing no hydroxy groups and carboxylic acids are classified as negative. The data set consists of 28 147 samples including 1 236 positives. The negative samples comprise 26 047 oxygen-free molecules and 864 carboxylic acids. The simplest rule which can be learned is that compounds without hydroxyl groups are classified as negative. In a second step a rule has to be found which discriminates between different hydroxyl groups.

In this experiment, we used a fully connected network based on ECFPs of radius 1. Radius 1 is sufficient for this task, because a hydroxyl group bound to a saturated carbon can be distinguished from a carboxylic acid group. In this experiment the model consisted of 4 layers of 1 024 units with SELU activation and achieved a test set AUC of . To investigate what was important for the predictions, we used Integrated Gradients. The feed-forward network is based on ECFPs, hence Integrated Gradients provide attributions for each fingerprint. Each fingerprint consists of multiple atoms and one atom is part of multiple fingerprints. Hence, we calculated the atom-wise attribution as the sum of the attributions of all fingerprints in which this atom is part of.

Figure 3 shows the attributions for five randomly selected molecules for each of the three different molecule types. In the top, middle and bottom row negative samples without a hydroxyl group, negative samples with a carboxylic acid group and positive samples are displayed, respectively. In the first row, almost all atoms obtain a negative attribution, which is reasonable since non of these atoms are part of a hydroxyl group. Only in a small fraction (1.9%) of the tested atoms small positive attributions were observed. In the second row atoms with carboxylic acids are shown. Atoms not belonging to the acid group are in general classified as negative, whereas the hydroxyl group obtains positive attributions. This means that the network is still able to identify that the hydroxyl group is important for a positive classification. In the third row molecules with hydroxyl groups are displayed. This group was identified as positively contributing to the prediction in 0.83% of the atoms by the network and Integrated Gradients. Due to the overlapping fingerprints, neighboring atoms are also obtaining a slightly positive attribution, whereas atoms further away are still clearly identified as negative contributions. This toy example has shown that the Integrated Gradients can be used to extract the rules underlying the classification.

After this toy experiment, in which we knew which rules had to be applied to classify the samples, we will investigate whether the decisions for a more complex task are still interpretable and reasonable. For this purpose, we used the Tox21 data set.

Tox21 Challenge data set.

In this experiment, we investigated whether Integrated Gradients can be used to extract chemical substructures which are important for classification into toxic and non-toxic chemical compounds on the largest available toxicity data set.

Figure 3: Attributions assigned to the atoms by the model for the three types of compounds. 5 randomly chosen negative samples without hydroxyl groups, negative samples with carboxylic acid groups and positive samples are shown in the top, middle and bottom row, respectively. Dark red indicates that these atoms are responsible for a positive classification, whereas dark blue atoms attribute to a negative classification.
Figure 4: Illustration of atom-wise attribution for 12 randomly drawn positive Tox21 samples. Attributions were extracted from a model trained on ECFP_2 fingerprints. The network clearly bases its decision on larger structures, which were built inside the model out of the small input features.

For this purpose, we trained a fully connected neural network consisting of 4 SELU layers with 2048 hidden units each 1024 ECFPs with radius 1 on the Tox21 data set. This network achieved a mean test AUC of . We followed the same procedure described above which consists of two major steps: applying Integrated Gradients and summarizing feature-wise attributions into atom-wise attributions. Figure 4 shows 12 randomly drawn, correctly classified positive test samples. It can be observed that positive attributions cluster together and form substructures. Please note that the model was trained only on small substructures, hence the formation of the larger pharmacophores is a direct result of the learning process. Together with the fact that some attributions are negative or close to zero, this indicates that the neural network is able to focus on certain atoms and substructures thereby differentiating between the indicative and not relevant parts of the input. The substructures on which the network bases its decision can be viewed as a pharmacophore-like substructure that indicates toxicity, a so-called “toxicophore”.

Up to now, we have only considered feed-forward neural networks. In the following sections, we will focus on the second prominent networks used for virtual screen: graph convolutional neural networks. We will show how these networks can be used to extract annotated substructures, such as pharmacophores and toxicophores, rather than focusing on interpreting individual samples. This knowledge can be helpful for understanding the basic mechanisms of biological activities.

4 Graph Convolutional Neural Networks

We implemented a new graph convolutional approach which is purely based on Keras

[6] 111The code is available at In our approach, we start similar to other approaches [7] with an initial atom representation which includes the atom type and the present bond type encoded in a one-hot vector. The network propagates this representation through several graph convolutional layers, whereas each layer performs two steps. In the first step, neighboring atoms are concatenated to form atom pairs. The convolutional filters slide over these atom pairs to obtain a pair representation. The second step is a summarization step, in which a new atom representation is determined. To obtain a new atom representation an order invariant pooling operation is performed over the atom-neighbor pairs. This newly generated representation is then fed into the next convolutional layer. This procedure is illustrated in Fig. 5 a). For training and prediction a summarization step which performs a pooling over the atom representations gives a molecular representation. This step is performed to have a molecular representation with a fixed number of dimensions so that the molecule can be processed by fully connected layers. This steps are shown in Fig. 5 b).
Formally, this can be described as following. Let be a molecular graph with a set of atom nodes . Each atom node is initially represented by a vector and has a set of neighboring nodes . In every layer the representations and are concatenated if is a neighboring node of to form the pair . The pair representation is the result of the activation function of the dot product of the trainable matrix of layer and the pair . The new atom representation is obtained as a result of any order invariant pooling function such as max, sum or average pooling.


A graph representation of layer is obtained through an atom-wise pooling step.


If skip connections are used, the molecule is represented by a concatenation of all .


Otherwise, the representation of the last layer is used.


To ensure that our implementation is on the same performance level as other state of the art graph convolutions, we used the ChEMBL benchmark data set. For the comparisons we used the same data splits as the original publication [21], therefore our results are directly comparable. Our approached achieved an AUC of , which is a 3% improvement compared to the best performing graph convolutional neural network.

Figure 5: a) illustrates the convolutional steps for the blue and the grey center atoms. A new atom representation is formed by convolving over the atom pairs and a subsequent pooling operation. b) shows the graph pooling step which follows the convolutional layers and the fully connected layers at the end of the network. These steps are performed in the training and prediction mode. c) displays the forward pass which is performed to obtain the most relevant substructures. Here, the graph pooling step is omitted and the substructures are directly fed into the fully connected layers.

4.1 Interpreting convolutional filters

In this section, we aim at extracting substructures that induce a particular biological effect. This information is useful for designing new drugs and for identifying mechanistic pathways.
In graph convolutional neural networks the convolutional layers learn to detect indicative structures, and later fully connected layers combine these substructures to a meaningful prediction. Our aim is to extract the indicative substructures which are learned within the convolutional layers. This can be achieved by skipping the atom-wise pooling step of Eq. 5 and propagating the atom representation through the fully connected layers as depicted in Fig. 5 c). Please note, that we can skip this step, because pooling over the atoms results again in a graph representation which has the same dimensions as the single atom representations. Therefore, we can use the feature representations of individual atoms in the same way as graph representations. Although the predictions are for single atoms, each atom was influenced by its proximity, therefore the scores can be understood as substructure scores. The receptive field of each atom increases with each convolutional layer by one, therefore represents substructures of different sizes, depending on how many layers were employed. The substructures are centered at atom and have a radius equal to the number of convolutional layers. Scores close to one indicate that the corresponding substructure is indicative for a positive label, whereas substructures with a score close to zero are associated with the negative label.

Ames mutagenicity data set

For this experiment we used a well studied mutagenicity data set. This data set was selected because there exist a number of well-known toxic substructures in the literature which we can leverage to assess our approach. The full data set was published by [10], of which we used the subset originally published by [12] as training set and the remaining data as test set. The training set consisted of 4 337 samples comprising 2 401 mutagenic and 1 936 non mutagenic compounds. The test set consisted in total of 3 315 samples containing 1 690 mutagens and 1 625 non mutagens. We trained a network with 3 convolutional layers with 1 024 filters each followed by a hidden fully connected layer consisting of 512 units. The AUC of the model on the test set was 0.804.
To assess which substructures were most important for the network, we propagated a validation set through the network and calculated the scores for each substructure as described above. The most indicative substructures are shown in Figure 6. Each substructure is displayed together with its SMILES representation and its positive predictive value (PPV) on the test set. These extracted substructures coincide very well with previous findings in the literature [12, 23, 34]. In Figure 7 some genotoxic structures found in the literature are displayed with a matching substructure identified by our method. The extracted substructures are known to interact with the DNA. Most of the them form covalent bonds with the DNA via uni-molecular and bi-molecular nucleophilic substitions (SN-1 and SN-2 reactions). Subsequently these modifications can lead severe DNA damage such as base loss, base substitutions, frameshift mutations and insertions [23].
Within this section we have demonstrated that our method for interpreting graph convolutional neural networks yield toxicophores consistent with literature findings and the mechanistic understanding of DNA damage.

Figure 6: This figure displays the structures extracted with our approach from the graph convolutional neural network. Below the structures the corresponding SMILES representation is shown together with the positive predictive value (PPV) on the test set. PPVs which were calculated on less than 5 samples are marked with an asterisk.
Figure 7: This figure shows annotated SMARTS patterns identified in the literature as mutagenic together with matching structures identified by our interpretability approach. The names of the structures are shown in the first column, the SMARTS patterns found in literature are displayed in the second column and the last column shows a matching example of the top scoring substructures identified by our method.

5 Discussion

Having a black box model with high predictive performance is often insufficient for drug development: a chemist is not able to derive an actionable hypothesis from just a classification of a molecule into “toxic” or “not toxic”. However, once a chemist “sees” the structural elements in a molecule responsible for a toxic effect, he immediately has ideas how to modify a molecule to get rid of these structural elements and thus the toxic effect. Therefore it is an essential goal is to gain additional knowledge and therefore it is necessary to shed light onto the decision process within the neural network and to retrieve the stored information. In this work, we have shown that the layers of a neural network construct toxicophores and that larger substructures are constructed in the higher layers. Furthermore, we have demonstrated that Integrated Gradients is an adequate method to determine the indicative substructures in a given molecule. Additionally, we propose a method to identify the learned toxicophores within a trai,ned network and demonstrated that these extracted substructures are consistent with the literature and chemical mechanisms.


  • [1] Ancona, M., Ceolini, E., Öztireli, C., Gross, M.: Towards better understanding of gradient-based attribution methods for deep neural networks. International Conference on Learning Representations (ICLR) (2018)
  • [2] Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., Müller, K.R.: How to explain individual classification decisions. J. Mach. Learn. Res. 11, 1803–1831 (Aug 2010),
  • [3]

    Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantifying interpretability of deep visual representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6541–6549 (2017)

  • [4] Bengio, Y.: Deep learning of representations: Looking forward. In: Proceedings of the First International Conference on Statistical Language and Speech Processing. pp. 1–37. SLSP’13, Springer-Verlag, Berlin, Heidelberg (2013)
  • [5] Cherkasov, A., Muratov, E.N., Fourches, D., Varnek, A., Baskin, I.I., Cronin, M., Dearden, J., Gramatica, P., Martin, Y.C., Todeschini, R., Consonni, V., Kuz’min, V.E., Cramer, R., Benigni, R., Yang, C., Rathman, J., Terfloth, L., Gasteiger, J., Richard, A., Tropsha, A.: QSAR modeling: Where have you been? where are you going to? Journal of Medicinal Chemistry 57(12), 4977–5010 (jan 2014)
  • [6] Chollet, F.: Keras. (2015)
  • [7] Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., Adams, R.P.: Convolutional networks on graphs for learning molecular fingerprints. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 2224–2232. Curran Associates, Inc. (2015)
  • [8] Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 1263–1272. PMLR, International Convention Centre, Sydney, Australia (06–11 Aug 2017)
  • [9] Hansen, K., Baehrens, D., Schroeter, T., Rupp , M., Müller, K.R.: Visual interpretation of kernel-based prediction models. Molecular Informatics 30(9), 817–826 (sep 2011)
  • [10] Hansen, K., Mika, S., Schroeter, T., Sutter, A., ter Laak, A., Steger-Hartmann, T., Heinrich, N., M"uller, K.R.: Benchmark data set for in silico prediction of ames mutagenicity. Journal of Chemical Information and Modeling 49(9), 2077–2081 (sep 2009)
  • [11] Huang, R., Sakamuru, S., Martin, M., Reif, D., Judson, R., Houck, K., Casey, W., Hsieh, J., Shockley, K., Ceger, P., et al.: Profiling of the tox21 10k compound library for agonists and antagonists of the estrogen receptor alpha signaling pathway. Scientific reports 4 (2014)
  • [12] Kazius, J., McGuire, R., Bursi, R.: Derivation and validation of toxicophores for mutagenicity prediction. Journal of Medicinal Chemistry 48(1), 312–320 (2005)
  • [13] Kearnes, S., McCloskey, K., Berndl, M., Pande, V., Riley, P.: Molecular graph convolutions: moving beyond fingerprints. Journal of Computer-Aided Molecular Design 30(8), 595–608 (aug 2016)
  • [14] Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. Advances in Neural Information Processing Systems 30 (NIPS) (2017)
  • [15] Lavecchia, A.: Machine-learning approaches in drug discovery: methods and applications. Drug Discovery Today 20(3), 318 – 331 (2015)
  • [16] Lin, S.: Pharmacophore perception, development and use in drug design. edited by osman f. güner. Molecules 5(7), 987–989 (2000)
  • [17] Lionta, E., Spyrou, G., K. Vassilatis, D., Cournia, Z.: Structure-based virtual screening for drug discovery: Principles, applications and recent advances. Current Topics in Medicinal Chemistry 14(16), 1923–1938 (2014)
  • [18] Lounkine, E., Keiser, M.J., Whitebread, S., Mikhailov, D., Hamon, J., Jenkins, J.L., Lavan, P., Weber, E., Doak, A.K., Cote, S., Shoichet, B.K., Urban, L.: Large-scale prediction and testing of drug activity on side-effect targets. Nature 486(7403), 361–367 (Jun 2012).
  • [19] Ma, J., Sheridan, R.P., Liaw, A., Dahl, G.E., Svetnik, V.: Deep neural nets as a method for quantitative structure–activity relationships. Journal of Chemical Information and Modeling 55(2), 263–274 (2015)
  • [20] Mayr, A., Klambauer, G., Unterthiner, T., Hochreiter, S.: Deeptox: Toxicity prediction using deep learning. Frontiers in Environmental Science 3,  80 (2016)
  • [21] Mayr, A., Klambauer, G., Unterthiner, T., Steijaert, M., Wegner, J.K., Ceulemans, H., Clevert, D.A., Hochreiter, S.: Large-scale comparison of machine learning methods for drug target prediction on chembl. Chemical science 9(24), 5441–5451 (2018)
  • [22]

    Olivecrona, M., Blaschke, T., Engkvist, O., Chen, H.: Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics

    9(1),  48 (2017)
  • [23] Plošnik, A., Vračko, M., Dolenc, M.S.: Mutagenic and carcinogenic structural alerts and their mechanisms of action. Archives of Industrial Hygiene and Toxicology 67(3), 169–182 (sep 2016)
  • [24] Preuer, K., Lewis, R.P.I., Hochreiter, S., Bender, A., Bulusu, K.C., Klambauer, G.: DeepSynergy: predicting anti-cancer drug synergy with deep learning. Bioinformatics 34(9), 1538–1546 (dec 2017)
  • [25] Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., Klambauer, G.: Fréchet ChemNet distance: A metric for generative models for molecules in drug discovery. Journal of Chemical Information and Modeling 58(9), 1736–1741 (aug 2018)
  • [26] Rogers, D., Hahn, M.: Extended-connectivity fingerprints. Journal of Chemical Information and Modeling 50(5), 742–754 (May 2010)
  • [27]

    Schütt, K.T., Arbabzadah, F., Chmiela, S., Müller, K.R., Tkatchenko, A.: Quantum-chemical insights from deep tensor neural networks. Nature Communications

    8, 13890 (jan 2017)
  • [28]

    Segler, M.H., Kogej, T., Tyrchan, C., Waller, M.P.: Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Science (2017)

  • [29] Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. Proceedings of the 34th International Conference on Machine Learning (ICML) (2017)
  • [30] Sushko, I., Salmina, E., Potemkin, V.A., Poda, G., Tetko, I.V.: Toxalerts: A web server of structural alerts for toxic chemicals and compounds with potential adverse reactions. Journal of Chemical Information and Modeling 52(8), 2310–2316 (2012)
  • [31] Unterthiner, T., Mayr, A., Klambauer, G., Steijaert, M., Wegner, J.K., Ceulemans, H., Hochreiter, S.: Multi-task deep networks for drug target prediction. In: Workshop on Transfer and Multi-task Learning of NIPS2014. vol. 2014, pp. 1–4 (2014)
  • [32] Unterthiner, T., Mayr, A., Klambauer, G., Steijaert, M., Wenger, J., Ceulemans, H., Hochreiter, S.: Deep learning as an opportunity in virtual screening. Deep Learning and Representation Learning Workshop (NIPS 2014) (2014)
  • [33] Unterthiner, T., Nessler, B., Klambauer, G., Heusel, M., Ramsauer, H., Hochreiter, S.: Coulomb GANs: Provably optimal nash equilibria via potential fields. International Conference of Learning Representations (ICLR) (2018)
  • [34] Yang, H., Li, J., Wu, Z., Li, W., Liu, G., Tang, Y.: Evaluation of different methods for identification of structural alerts using chemical ames mutagenicity data set as a benchmark. Chemical Research in Toxicology 30(6), 1355–1364 (may 2017)
  • [35] Yang, X., Zhang, J., Yoshizoe, K., Terayama, K., Tsuda, K.: ChemTS: an efficient python library for de novo molecular generation. Science and Technology of Advanced Materials 18(1), 972–976 (2017)