I Introduction
Automated design of molecules with desired properties is an emerging field with important applications including drug discovery [6] and material design [9]. Despite the complex structure of many molecules, in particular organic molecules, specific properties, e.g., toxicity or solubility in water, may be caused by substructures within a molecule. These substructures are called functional groups (FGs); e.g., toxophores are a specific type of FG’s that produce toxicity in toxin molecules [21]. The study of common FGs is central to organic chemistry [3], see, e.g., the hydroxyl FG (), as they can be used to systematically predict behavior of a compound in chemical reactions. Identifying of FGs can hint how to design molecules that posses particular properties.
The experimental approach to FG discovery requires highprecision instruments and many hours of expert supervision. In modern applications such as drug discovery, estimated to have a space of
to molecules [19], experimentally searching the space is infeasible. Fortunately, recent efforts in combining computational chemistry and machine learning (ML) leverage data driven methods suggests possibility of efficiently narrowing this search space
[5, 9, 27]. This stems from success of machine learning in areas such as computer vision and natural language processing.
Recent success in computer vision is due to emergence of deep convolutional neural networks (CNNs) [16] that has led to state of the art performance on classification [10] and object detection [20]
tasks. The success of CNNs may be attributed to convolutional layers, which reduce the number of learnable parameters and allow deeper networks for multilevel abstraction of feature extraction. In addition, these features may be used to localize signal to regions of the input, giving a means of interpreting decisions by the network
[29]. ’Despite this success, traditional deep CNNs are designed for Euclidean space where data is defined on a structured grid, e.g. domain of an image, as convolution is an operation defined on Euclidean space. For this reason, CNNs cannot be used directly on domains with different data structures such as graph structured data.A recent variant of CNNs designed for graph structured data are graph convolution neural networks (GCNNs) [5, 8]
, which extend the definition of convolution to nonEuclidean spaces. GCNNs inherit ideas like shared weights and deep hierarchical feature distillation from CNNs and have led to promising results in classifying graphstructured data
[14]. Building upon the success of CNNs in computer vision, a recent line of research has applied GCNNs to atomic and molecular applications [5, 11, 23, 27, 9]. Inheriting properties of CNNs, GCNNs have led to promising results. In these methods, molecules are modeled as graphs, where the graph nodes represent the atoms, and the graph edges (potentially weighted) represent the chemical bonds and their types, and learning is performed on the moleculelevel.Here, we investigate, for the first time, GCNN methods that determine localized parts of a graph corresponding to a specific classification, as inspired by related work on images [29]. We compare four different methods: gradientbased saliency maps, Class Activation Mapping (CAM), Gradientweighted Class Activation Mapping (GradCAM), and Excitation BackPropagation (EBP). We evaluate their performance with respect to contrastiveness (classspecific localization) and sparsity of localization and qualitatively compare the localization heatmaps over the graph structure. For the graph input, we focus on molecules. So, highlighted structural components can show FGs that correspond to a certain chemical property.
Ii Related Work
Various ML techniques have been recently applied to applications within molecular level chemistry and biochemistry. Given a proper feature extraction method that can convert all molecules in a dataset to fixedsize vectors, we can use ML techniques to automatically predict chemical characteristics of molecules
[4]. But finding the right feature extraction method is difficult and can be restrictive. Recent ML progress is mainly based on the emergence of deep neural architectures including CNNs, which automatize the process of feature extraction in an endtoend and data driven scheme. The challenge of applying CNNs on molecule datasets is that CNNs can only receive structured data as their input, e.g., images. This limitation has been circumvented by the invention of graph convolutional neural networks (GCNN). GCNNs provide an extension of CNNs to nonEuclidean spaces and are suitable for handling graph structured data such as molecules. Similar to CNNs, GCNNs are able to learn descriptive features automatically that outperform engineered features, enabling GCNNs to achieve stateoftheart performance on several chemical prediction tasks, including toxicity prediction [11], solubility [5], and energy prediction [23].A long standing limitation of general deep neural networks has been the difficulty in interpreting and explaining the classification results. Recently, explainability methods have been devised for deep networks and specifically CNNs [25, 29, 24, 28]. These methods enable one to probe a CNN and identify the important substructures of the input data (as deemed by the network), which could be used as an explanatory tool or as a tool to discover unknown underlying substructures in the data. For example, in the area of medical imaging, in addition to classifying images having malignant lesions, they can be localized, as the CNN can provide reasoning for classifying an input image. Here, we are interested in measuring the potential of these methods for discovery of FGs in organic molecules.
The most straightforward approach for generating a sensitivity map over the input data to discover the importance of the underlying substructures is to calculate a gradient map within a layer by considering the norm of the gradient vector with respect to an input for each network weight [25]. However, gradient maps are known to be noisy and smoothening these maps might be necessary [26]. More advanced techniques include Class Activation Mapping (CAM) [29], Gradientweighted Class Activation Mapping (GradCAM) [24], and Excitation BackPropagation (EB) [28] techniques that improve gradient maps by taking into account some notion of context. These techniques have been shown to be effective on CNNs and can identify highly abstract notions in images.
Inspired by the explainability power of deep CNNs, our goal is to adapt these techniques for deep GCNNs to automatize discovery of chemical FGs for a particular behavior. This process can be particularly helpful for FGs because, as opposed to images, humans cannot intuitively determine the relevant context within a molecule for a particular property of that molecule. Our specific contributions in this work are:

Adapting explanation tools for CNNs to GCNNs with the application of discovering FGs for organic molecules,

Comparing the contrastive power and class specificity of the explainability methods in identifying FGs, and

Analyzing three molecular datasets and their identified FGs.
We envision that our proposed framework could help chemists with identifying new FGs that have not been discovered before, reducing the experimental cost and the required time needed for this purpose.
Iii Methods
We compare and contrast the application of four popular explainability methods to Graph Convolutional Neural Networks (GCNNs). These methods are gradientbased saliency maps [25], Class Activation Mapping (CAM) [29], Gradientweighted Class Activation Mapping (GradCAM) [24], and Excitation BackPropagation (EBP) [28]. Furthermore, we explore the benefits of a number of enhancements to these approaches.
Iiia Explainability for Convolutional Neural Networks
Perhaps the most straightforward (and wellestablished) approach is that of gradientbased saliency maps [25]. In this approach, one simply differentiates the output of the model with respect to the model input, thus creating a heatmap, through the norm of the gradient over input variables, indicating their relative importance. Note that the resulting gradient in the input space points in the direction corresponding to the maximum positive rate of change in the model output. Therefore the negative values in the gradient are discarded to only retain the parts of input that positively contribute to the solution:
(1) 
where is the score for class
before the softmax layer, and
is the input. While easy to compute and interpret, saliency maps generally perform worse than newer techniques (like CAM, GradCAM, and EB), and it was recently argued that saliency maps tend to represent noise rather than signal [13].The CAM approach provides an improvement over saliency maps for convolutional neural networks, including GCNNs, by identifying important, classspecific features at the last convolutional layer as opposed to the input space. It is wellknown that such features tend to be more semantically meaningful (e.g., faces instead of edges). The downside of CAM is that it requires the layer immediately before the softmax classifier (output layer) to be a convolutional layer followed by a global average pooling (GAP) layer. This precludes the use of more complex, heterogeneous networks, such as those that incorporate several fully connected layers before the softmax layer.
To compute CAM, let be the feature map of the convolutional layer preceding the softmax layer. Denote the global average pool (GAP) of by
(2) 
where . Then, a given class score, , can be defined as
(3) 
where the weights are learned based on the inputoutput behavior of the network. The weight encodes the importance of feature for predicting class . By upscaling each feature map to the size of the input images (to undo the effect of pooling layers) the classspecific heatmap in the pixelspace becomes
(4) 
The GradCAM method improves upon CAM by relaxing the architectural restriction that the penultimate layer must be a convolutional. This is achieved by using feature map weights that are based on backpropagated gradients. Specifically, GradCAM defines the weights according to
(5) 
Following the intuition behind Equation (4) for CAM, the heatmap in the pixelspace according to GradCAM is computed as
(6) 
where the ReLU function ensures that only features that have a positive influence on the class prediction are nonzero.
Excitation BackPropagation is an intuitively simple, but empirically effective explanation method. In [22], it is argued and demonstrated experimentally that explainability approaches such as EB [28], which ignore nonlinearities in the backwardpass through the network, are able to generate heatmaps that “conserve” evidence for or against a network predicting any particular class. Let
be the i’th neuron in layer
of a neural network and be a neuron in layer . Define the relative influence of neuron on the activation of neuron , where and for being the synaptic weights between layers and, as a probability distribution
over neurons in layer . This probability distribution can be factored as(7) 
Zhang et al. then define the conditional probability as
(8) 
where
is a normalization factor such that . For a given input (e.g., an image), EB generates a heatmap in the pixelspace w.r.t. class by starting with at the output layer and applying Equation (7) recursively.
These reviewed explainability methods are originally designed for CNNs, which are defined on a signal supported on a uniform grid. Here, we are interested in signals supported on nonEuclidean structures, e.g. graphs. In what follows, we first briefly discuss GCNNs and then describe the extensions of these explainability methods to GCNNs.
IiiB Graph Convolutional Neural Networks
Let an attributed graph with nodes be defined with its node attributes and its adjacency matrix (weighted or binary). In addition, let the degree matrix for this graph be . Following the work of Kipf and Welling [15], we define the graph convolutional layer to be
(9) 
where is the convolutional activations at the layer, , is the adjacency matrix with added self connections where
is the identity matrix,
, are the trainable convolutional weights, andis the elementwise nonlinear activation function. Figure
1 shows the used GCNN architecture in this work, where the activations in layers follow Eq. (9), which is a firstorder approximation of localized spectral filters on graphs.For molecule classification, each molecule can be represented as an attributed graph , where the node features summarize the local chemical environment of the atoms in the molecule, including atomtypes, hybridization types, and valence structures [27], and the adjacency matrix encodes atomic bonds and demonstrate the connectivity of the whole molecule (see Figure 1). For a given dataset of labeled molecules with labels indicating a certain chemical property, e.g., bloodbrainbarrier penetrability or toxicity, the task is to learn a classifier that maps each molecule to its corresponding label, .
Given that our task is to classify individual graphs (i.e., molecules) with potentially different number of nodes, we use several layers of graph convolutional layers followed by a global average pooling (GAP) layer over the graph nodes (i.e. atoms). In this case, all graphs will be represented with a fixed size vector. Finally, the GAP features are fed to a classifier. To enable applicability of CAM [29], we simply used a softmax classifier after the GAP layer.
IiiC Explainability for Graph Convolutional Neural Networks
In this subsection, we describe the extension of CNN explainability methods to GCNNs. Let the ’th graph convolutional feature map at layer be defined as:
(10) 
where denotes the column of matrix , and (see Eq. (8)). In this notation, for the ’th atom of the molecule, the ’th feature at the ’th layer is denoted by . Then, the GAP feature after the final convolutional layer, , is calculated as
(11) 
and the class score is calculated as, . Using this notation, the explainability methods could be extended to GCNNs as follows:
Gradientbased atomic heatmaps for the ’th atom are calculated as
(12) 
CAM atomic heatmaps for the ’th atom are calculated as
(13) 
GradCAM’s class specific weights for class at layer and for the ’th feature are calculated by
(14) 
and the heatmap for the ’th atom calculated from layer is
(15) 
GradCAM enables us to generate heatmaps with respect to different layers of the network. In addition, for our model shown in Figure 1, GradCAM’s heatmap at the final convolutional layer and CAM’s heatmap are equivalent (See [24] for more details). In this work, we report results for as well as
(16) 
Excitation Backpropagation
’s heatmap for our model is calculated via backward passes through the softmax classifier, the GAP layer, and several graph convolutional layers. The equations for backward passes through the softmax classifier and the GAP layer are(17) 
where for the class of interest and zero otherwise. The backward passes through the graph convolutional layers, however, are more complicated. For notational simplicity, we decompose a graph convolutional operator into
(18) 
where the first equation is a local averaging of atoms (with
), and the second equation is a fixed perceptron applied to each atom (analogous to onebyone convolutions in CNNs). The corresponding backward passes for these two functions can be defined as
(19) 
We generate the heatmap over the input layer by recursively backpropagating through the network and averaging the backpropagated probability heatmaps on the input:
(20) 
The contrastive extension of follows Eq. (8) in [28]; we call this contrastive variant, cEB.
Iv Experiments
This section describes the experimental setup, results of classspecific explanations, and a substructure frequency analysis identifying relevant FGs.
Iva Experimental Setup
We evaluated explanation methods on three binary classification molecular datasets, BBBP, BACE, and task NRER from TOX21 [27]. Each dataset contains binary classifications of small organic molecules as determined by experiment. The BBBP dataset contains measurements on whether a molecule permeates the human blood brain barrier and is of significant interest to drug design. The BACE dataset contains measurements on whether a molecule inhibits the human enzyme secretase. The TOX21 dataset contains measurements of molecules for several toxicity targets. We selected the NRER task from this data, which is concerned with activation of the estrogen receptor [18]. These datasets are imbalanced. Class ratios for each dataset are reported in Table I.
In addition, we followed the recommendations in [27] for train/validation/test partitioning, where random partitioning is not recommended due to domainspecific considerations. In particular, for BACE and BBBP, a ”scaffold” split is recommended, which partitions molecules according to their structure, i.e. structurally similar molecules are partitioned in the same split. As this is recommended in existing literature, we follow this convention.
Using 80:10:10 train/validation/test split, we report AUCROC and AUCPR values of our trained model for each dataset in Table II. These results are comparable to those reported in [27], and confirm that the model was trained correctly.
Positives  Negatives  

BBBP  1560  479 
BACE  691  821 
TOX21  793  5399 
AUCROC  AUCPR  

Train  Validation  Test  Train  Validation  Test  
BBBP  0.991 0.001  0.991 0.003  0.966 0.007  0.994 0.003  0.992 0.007  0.966 0.012 
BACE  0.991 0.001  0.986 0.007  0.999 0.001  0.939 0.043  0.924 0.036  0.999 0.001 
TOX21  0.868 0.006  0.873 0.009  0.881 0.008  0.352 0.047  0.348 0.057  0.347 0.048 
For all datasets, we used the GCNN + GAP architecture as described in Figure 1
with the following configuration: three graph convolutional layers of size 128, 256, and 512, respectively, followed by a GAP layer, and a softmax classifier. Models were trained for 100 epochs using the ADAM optimizer with learning rate
, ,. The models were implemented in Keras with Tensorflow backend
[2].IvB ClassSpecific Explanations
After training models for each dataset, we computed each explanation method on all samples and generated heapmaps showing relevant substructures (Figure 3 shows selected results). The heatmaps are calculated for positive and negative classes and normalized for each molecule across both classes and nodes to form a probability distribution over the input nodes. Class specificity can be seen by comparing explanations across classes within a method, i.e., nodes activated by one class tend to be inactivate for the other.
To quantitatively measure class specificity, we propose a measure based on the hamming distance between binarized heatmaps for positive and negative classes (Figure
2). We report the ratio of the hamming distance of the binarized heatmaps to the total number of identified atoms by heatmaps in Table III. GradCAM and cEB showed the most contrastive power. In addition, to account for varying number of atoms and sparsity of identified substructures in different molecules, we include a measure of sparsity of the activation (Table IV). We define this measure as the number of identified atoms in either explanation, divided by the total number of atoms. The contrastive Excitation BackPropagation method showed the sparsest activations.IvC Substructure Frequency Analysis
Analyzing a collection of molecules with a given property for common substructures is a known technique for discovering relevant functional groups [1, 17]. As the number of all possible substructures in a set of molecules is huge, these methods typically restrict analysis to a set of substructures obtained from a fragmentation algorithm. Here, we close the loop for discovering FGs by identifying functional molecular substructures from the machine learned generated heatmaps.
GCNN explanations (i.e., heatmaps) often occur on colocated regions in a molecule and provide a data driven means of generating candidate substructures. Further, we can analyze generated heatmaps for reoccurring patterns which may yield fruitful insights or help select candidate functional substructures. We devised an automated method for counting the occurrence of substructures identified from the heatmaps generated by the explanation methods. In short, we counted the frequency of each substructure observed in explanations for the three dataset and further analyzed their class specificity.
To identify each substructure, we took the largest connected components consisting of atoms with explanation value greater than some threshold (here, 0), which we call activated atoms, and edges between such atoms. After extracting the activated connected components as identified by the heatmaps, we count their frequency. This analysis requires comparing molecular substructures, a functionality found in open source computational chemistry libraries such as RDKit. We restricted our method to consider only exact substructure comparisons.
A prevalent substructure in the dataset may artificially show a high prevalence in the generated heatmaps. To account for this potential imbalance, we counted the occurrences of explanationidentified substructures in both positive and negative labeled data in the dataset. We used these counts to normalize the counts obtained from the explanations and construct two ratios:
and
where , and are the number of times a substructure occurs in explanations, the positively labeled data, and the negatively labeled data respectively. The ratio measures how often a substructure occurs in explanations. The second one measures how often a substructure occurs in positively labeled data, and serves as a baseline for the first. Note that a high corresponds to high class specificity for an identified substructure.
These ratios are sensitive to rare substructures. For instance if a substructure occurs only once and occurs in the explanations then it has . To mitigate this sensitivity, we report only substructures that occur more than times in the dataset.
Figure 4 shows the most prominent substructures according to our analysis. Here, due to space limitation, we used only GradCAM to extract the explanations because it was the most contrastive method (Table III). Additionally, we restricted the explanations to those samples which were true positives. We ranked substructures by and report the top 10 for each dataset. As shown in the Figure, the identified substructures have high class specificity. In addition, we observed a few patterns of known FGs being discovered by our method: Halogens (Cl,F,Br) are prevalent in explanations for BBBP, Amides are prevalent in explanations for BACE, and aromatic ring structures are prevalent for TOX21.
BBBP  BACE  TOX21  
CG  0.45 2.19  0.77 2.99  0.2 2.13 
GradCAM  99.99 0.11  100.0 0.0  99.99 0.29 
GradCAM Avg  41.0619.05  29.2214.04  44.0323.7 
EB  50.8718.76  60.2915.40  49.0622.59 
cEB  96.97 5.68  97.045.12  97.239.3 
BBBP  BACE  TOX21  
CG  99.78 2.43  99.721.58  99.792.98 
GradCAM  93.747.83  90.647.67  95.148.74 
GradCAM Avg  99.990.07  100.00.0  99.990.11 
EB  59.6522.11  48.613.97  69.8823.04 
cEB  59.4621.69  46.9913.95  69.6922.91 
V Conclusion
In this work, we extended explainability methods, which were designed for CNNs, to GCNNs, compared these methods qualitatively and quantitatively, and used them as a tool for discovering functional groups in organic molecules.
We compared four methods for identifying molecular substructures that are responsible for a given classification. The GradCAM methods could qualitatively identify functional groups that are known to be chemically active for the specified tasks, e.g., CF3 for brain bloodbarrier penetration [7, 12]. For other identified functional groups, additional experiments are needed to confirm their chemical properties.
With our metrics, GradCAM and cEB showed the best contrastive power for showing substructures for different classes. Compared to GradCAM, cEB showed sparser activations, which could be an advantage in certain applications. For the benzene group, however, cEB could not capture the entire group because the activation was too sparse. Here, GradCAM performed better. So, apparently, there is an optimal value for the sparsity, which may depend on the application.
Our results provide a pathway for automated discovery of functional relevant groups of molecules. Such an system is a stepping stone towards fully automated drug discovery. Finally, the proposed framework is a generic tool for discovering functional substructures in general graphs, including social networks and electrical grids.
References
 [1] Y. Chen, F. Cheng, L. Sun, W. Li, G. Liu, and Y. Tang. Computational models to predict endocrinedisrupting chemical binding with androgen or oestrogen receptors. Ecotoxicology and Environmental Safety, 110:280 – 287, 2014.
 [2] F. Chollet. keras. https://github.com/fchollet/keras, 2015.
 [3] F. E. Critchfield. Organic functional group analysis. 1963.
 [4] G. E. Dahl, N. Jaitly, and R. Salakhutdinov. Multitask neural networks for qsar predictions. arXiv preprint arXiv:1406.1231, 2014.
 [5] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. AspuruGuzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
 [6] E. Gawehn, J. A. Hiss, and G. Schneider. Deep learning in drug discovery. Molecular informatics, 35(1):3–14, 2016.
 [7] A. Ghosh, M. Brindisi, and T. J. Developing bsecretase inhibitors for treatment of alzheimer’s disease. Journal of Neurochemistry, 120:71–83, 2012.
 [8] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, 2017.
 [9] R. GómezBombarelli, J. N. Wei, D. Duvenaud, J. M. HernándezLobato, B. SánchezLengeling, D. Sheberla, J. AguileraIparraguirre, T. D. Hirzel, R. P. Adams, and A. AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
 [10] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
 [11] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30(8):595–608, 2016.
 [12] S. Kim, H. Lee, and I. Kim. Robust neuroprotective effects of 2((2oxopropanoyl)oxy)4(trifluoromethyl)benzoic acid (optba), a htb/pyruvate ester, in the postischemic rat brain. Scientific Reports, 6, 2016.
 [13] P.J. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Schütt, S. Dähne, D. Erhan, and B. Kim. The (un) reliability of saliency methods. arXiv preprint arXiv:1711.00867, 2017.
 [14] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 [15] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In Advances in neural information processing systems, 2017.
 [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [17] A. Lombardo, F. Pizzo, E. Benfenati, A. Manganaro, T. Ferrari, and G. Gini. A new in silico classification model for ready biodegradability, based on molecular fragments. Chemosphere, 108:10 – 16, 2014.
 [18] A. Mayr, G. Klambauer, T. Unterthiner, and S. Hochreiter. Deeptox: toxicity prediction using deep learning. Frontiers in Environmental Science, 3:80, 2016.
 [19] P. G. Polishchuk, T. I. Madzhidov, and A. Varnek. Estimation of the size of druglike chemical space based on gdb17 data. Journal of computeraided molecular design, 27(8):675–679, 2013.
 [20] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.
 [21] J. E. Ridings, M. D. Barratt, R. Cary, C. G. Earnshaw, C. E. Eggington, M. K. Ellis, P. N. Judson, J. J. Langowski, C. A. Marchant, M. P. Payne, et al. Computer prediction of possible toxic action from chemical structure: an update on the derek system. Toxicology, 106(13):267–279, 1996.
 [22] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.R. Müller. Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems, 28(11):2660–2673, 2017.
 [23] K. Schütt, P. Kindermans, H. E. S. Felix, S. Chmiela, A. Tkatchenko, and K. Müller. Schnet: A continuousfilter convolutional neural network for modeling quantum interactions. In Advances in Neural Information Processing Systems, pages 991–1001, 2017.
 [24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Gradcam: Visual explanations from deep networks via gradientbased localization. In ICCV, pages 618–626, 2017.
 [25] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 [26] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
 [27] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
 [28] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Topdown neural attention by excitation backprop. In European Conference on Computer Vision, pages 543–559. Springer, 2016.

[29]
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
Learning deep features for discriminative localization.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2921–2929, 2016.