Recent advances of deep learning (DL) methods [1, 2, 3, 4, 5, 6, 7, 8] boost the performance of quantitative structure-activity relationships (QSAR) models on predicting chemical or biological properties of molecules in drug discovery industry [9, 10]. However, many DL models function as black-boxes , which means that given a molecule with physiochemical/biological and structural features, DL models usually predict a simple global score for certain desired property, leaving the inference rationales, such like local judgement on chemical structure, an unknown status. Interpretation of such rationales is useful to reveal relationship between structure and property, optimize compound structure, and validate DL models with subjective opinion (chemical or biological knowledge) [9, 11, 8]. Especially, recent Graph Neural Network (GNN) based methods [1, 6, 12, 13, 14, 15]16]
and support vector machines (SVM), most of them still function as black-boxes, providing uninterpretable results . In this work, we propose a multi-layer self-attention based Graph Neural Network framework, namely Ligandformer, for predicting compound property with robust interpretation, which reflects machine interest on local region of input molecule.
2 Related Work
2.1 Molecule Representation
For molecule chemical representation, current GNN methods [9, 11, 8] usually process 2D graph as description of natural chemical graph, in which nodes represent atoms integrating different chemical attributes, and edges represent bonds connecting atoms to one another. There are mainly three advantages of using 2D graph description: (1) graph preserves clear and stable information of chemical structure, (2) it represents invariant molecule regardless of entry position in line notation (e.g., SMILES ), (3) it can be easily computed and optimized by GNN methods. In our chemical formulation, we take similar 2D graph representation, meanwhile we adopt bidirectional graph where the bond connection from atom A to atom B is the same as the bond connection from atom B and atom A. Moreover, 7 atomic chemical attributes, listed in Figure 1, are considered as node initial features of input graph. Our GNN method learns and aggregates these attributes to be proper molecular features for predicting certain property. For data preprocessing, Ligandformer converts SMILES sequence into 2D graph formula, and node attributes in Figure 1 c can be used for distinguishing those compounds with same molecular structures. Hence, each molecule representation in our method is unique.
2.2 Self-Attention Mechanism
Self-attention mechanism was firstly designed as key characteristic in Transformers [19, 20, 21, 22, 23, 24, 8]. The mechanism can be considered as a graph-like inductive bias which relevantly pools features of all tokens in an input sample. Such bias can be interpreted as machine’s focus on each element of input sample.
The core operation of the mechanism is called single-head self-attention mechanism, in which each element learns to relevantly aggregate over all elements of an input . The definition of a single-head self-attention mechanism is as follows:
where , and are linear projections applied on the hidden features of the input . are the weight matrices for query, key and value transformations, and is the dimension number of hidden input features . In equation (1), the attention matrix takes charge of learning relevant scores between any two elements of input , which gives the reason why the process is called self-attention. Figure 1 b shows the working mechanism of a single-head self-attention layer. For multi-head self-attention, the operations on are densely conducted times in parallel, which thereby output a series of . The final output , where is the translation parameters for output linear projection.
3 Method and Experiments
In this section, we describe our interpretable deep learning framework, namely Ligandformer, and demonstrate its performance on predicting compound’s chemical/biological properties. For a given molecule, we utilize Python RDkit  tool kit to process molecule SMILES as input data format, and convert the data into bidirectional graph based on Deepchem and Chemprop [26, 27] processing ways. The graph mainly consists of index lists of nodes and edges. For each node, each attribute in Figure 1 c
is converted into identical number, and all 7 numeric attributes are combined as initial node feature vector. The node feature is then delivered to node embedding layer of our framework for learning optimal attribute combination, from which we can obtain an aligned node feature vector .
3.1 Self-attention based Graph Neural Networks Architecture
Similar to other Transformers, the block of our graph neural network framework is yet characterized by a graph neural network (GNN) module, a single-head self-attention mechanism, layer normalization layers  and residual passages. However, Ligandformer forms as a wide and parallel architecture, showed in Figure 1 a, and all hidden features from previous blocks are concatenated and fed to self-attention layer of next block. The idea behind this dense connection is to enhance the message passing over different self-attention layers, and thus improve the robustness of attention mechanism. Different from SAMPN , another recent interpretable GNN method which concatenates self-attention mechanism between MPN 
backbone and classifier/regressor, Ligandformer has a very different architecture that integrates self-attention mechanism into each computational block and merges attention information from different layers for final visible interpretation. We describe Ligandfomer architecture details and demonstrate its strengths in the following parts of this paper.
For graph feature aggregation and message passing, we modified the structure of GIN  and used it as our GNN module. The module is a variant of spatial based graph neural networks [29, 30, 12], which aggregates features by taking both summation and maxima of neighbouring features, and thus enhances the message propagation from shallow blocks to deep blocks. Specifically, for each node in graph , its node features from GNN module in th block is calculated as:
where is the MLP function , and concatenates features of node from th block and th block, denotes a neighbouring node in node ’s neighbourhood . The output node feature are concatenated with node features from all the previous blocks, i.e., , and then feed to a single-head self-attention layer using equation (1), the final node feature from th block is computed as:
Where means layer normalization  and is the element wise addition along the feature vector. We then take a pooling read-out function over all the nodes of graph , and thus obtain its graph representation from th block. The graph representations from all the layers are concatenated and fed to a 3-layer MLP
functional head to predict property probability:
in which is the block number.
For loss function and optimization, in this work we mainly investigate the binary classification problem, i.e., positive or negative of certain property, and thus we adopt cross-entropy as our loss function, which can be simply written as:
Also, our method can be applied in different kinds of tasks, e.g,. regression, multi-classification, ranking, etc., by modifying the corresponding functional head and loss function. We tried SGD , Adam  and Adabelief  to optimize model parameters, and find that Adam provides the best optimal model with highest evaluation performance and most stable convergence during training and testing.
3.2 Dataset and Experiments
To verify the generalization ability of Ligandformer, we selected three chemical properties from different domains in drug research & development pipelines: 1. aqueous solubility, 2. Caco-2 cell permeabilty and 3. Ames mutagenesis from ADME/T domains. We trained and tested Ligandformer on public disclosed datasets of these three properties.
Aqueous solubility is the chemical saturated concentration in the aqueous phase, which is usually described with unit log(mol/L) and written as logS. This dataset was downloaded from the website, namely online chemical database and modeling environment (OCHEM) , which includes 1,311 assay records. To ensure this task is binary classification, we set as cut-off for distinguishing soluble or insoluble of each compounds in the dataset.
Caco-2 Cell Permeabilty
Caco-2 cell permeability  assay measures the rate of flux of a chemical across polarised Caco-2 cell monolayers and the results from the assay can be used to predict drug in vivo absorption. We collected and cleaned 7,624 assayed data from public and commercial sources, e.g., REAXYS , and set as a threshold for judging positive and negative w.r.t absorption.
Ames mutagenesis  assay usually uses bacteria to test whether a given chemical can cause mutations in the DNA of the test organism. A positive assayed result indicates that the chemical is mutagenic, which implies high risk of inducing cancer. In our case, 7,617 assayed data were collected from public sources as our training and testing dataset. We set Colonies 2 fold as the cut-off for binary classification, where means positive, otherwise negative.
We trained and tuned each model’s parameters under same default protocol with three properties respectively. In details, we took a 90-10 split on three datasets respectively, where 90% of dataset was used for training and the remaining 10% was used for testing. Random shuffle on each dataset was used for ensuring that each input data sample give an independent change on the model in each training batch. Also, we resmapled the positive and negative data of training samples, balancing their number ratio to 1:1, which avoids the training skewness towards the class with majority samples. For the hyperparameters configuration, we configured 3 characterized blocks for the architecture, and selected Adam optimizer with learning rateand weight decay
. The training batch size was fixed to 256, and we utilized early stop = 50 to avoid overfitting, which means the training procedure will stop after 50 epochs when the testing performance (i.e., AUROC) does not continue to increase at all. We compared Ligandformer with recent designed counterparts, i.e., MPN, SAMPN w.r.t peformance on predicting three chemical properties. As we can see from Table1, Ligandformer constantly outperforms in terms of AUROC, reaching up to 0.98, 0.89 and 0.92 respectively. Furthermore, since single head self-attention mechanism is applied in Ligandformer, we believe that multi-head attention mechanism can further leverage Ligandformer’s performance on predicting chemical/biological properties.
|Aqueous Solubility||Caco-2 Cell Permeabilty||Ames Mutagenesis|
3.3 Attention Map Visualization
Throughout the discussion before, while perusing higher prediction accuracy, robust and efficient interpretation on QSAR model is also important. By utilizing visualization technique on attention mechanism, researchers can directly validate and identify the learned features that determine compound property predictions. In Ligandformer, for an input molecule, each block can provide relevant attention score matrix for all atoms. In , while the summation of each columns equals to 1, we define the average value of each row of as attention coefficients to observe each atom’s contribution to the property. While attention coefficients from shallow blocks represents atomic or local parts contribution, attention coefficients from deeper blocks reflects fragments or larger scale parts contribution. We calculate the average of attention coefficients from all blocks to gain an integrated attention map from different scopes. Using heat map to visualize the integrated attention helps chemist/biologist to find which parts of molecule play a more important role in certain property, and therefore assist researchers to optimize compound structure accordingly. Figure 2 shows several visualization examples given by Ligandformer and SAMPN respectively.
To our knowledge, deep learning models often suffer from prediction instability. Even though the overall testing performance stays the same in different training rounds, specific prediction scores for the same testing sample are usually different. This is mainly because of the random initial model parameters. Ligandformer overcomes such instability by integrating all the attention coefficients from different blocks. To demonstrate the robustness of attention map, we launched two different training rounds with random initial model parameters on same training and testing datasets of aqueous solubility. As shown in Figure 3, even though two corresponding attention maps of the same block are different in different training rounds, the final corresponding integrated (averaged) attention maps remain the same in general. It indicates that, given same training materials on certain property, Ligandformer can output robust attention map.
In this work, we have introduced our Ligandformer, a multi-layer single-head self-attention based Graph Neural Network framework, for predicting compound chemical property with robust interpretation, i.e., integrated attention map. Facilitated with visualization technique, the map shows insights on AI model’s rationales on judging which parts of an input molecule impact certain property predictions, and thus it can support researchers to investigate chemical or biological property and optimize structure efficiently. Our Ligandformer outperforms over recent GNN counterparts, giving more robust interpretation. Ligandformer also demonstrated a good generalization ability to predict different chemical properties with high accuracy. We hope this interpretable QSAR method can contribute to scientific community and drug discovery industry as well.
- Wu et al.  Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
Feinberg et al. 
EN Feinberg, R Sheridan, E Joshi, VS Pande, and AC Cheng.
Step change improvement in admet prediction with potentialnet deep featurization. arxiv. org, 2019.
- Sarkar et al.  Dipanjan Sarkar, Shyamal Sharma, Subhasis Mukhopadhyay, and Asim Kumar Bothra. Qsar studies of fabh inhibitors using graph theoretical & quantum chemical descriptors. Pharmacophore, 7(4), 2016.
- Shao et al.  Zheng Shao, Yuya Hirayama, Yoshihiro Yamanishi, and Hiroto Saigo. Mining discriminative patterns from graph data with multiple labels and its application to quantitative structure–activity relationship (qsar) models. Journal of chemical information and modeling, 55(12):2519–2527, 2015.
- Wang et al.  Xiaofeng Wang, Zhen Li, Mingjian Jiang, Shuang Wang, Shugang Zhang, and Zhiqiang Wei. Molecule property prediction based on spatial graph embedding. Journal of chemical information and modeling, 59(9):3817–3828, 2019.
- Liu et al.  Ke Liu, Xiangyan Sun, Lei Jia, Jun Ma, Haoming Xing, Junqiu Wu, Hua Gao, Yax Sun, Florian Boulnois, and Jie Fan. Chemi-net: a molecular graph convolutional network for accurate drug property prediction. International journal of molecular sciences, 20(14):3389, 2019.
- Goulon et al.  A Goulon, T Picot, A Duprat, and G Dreyfus. Predicting activities without computing descriptors: graph machines for qsar. SAR and QSAR in Environmental Research, 18(1-2):141–153, 2007.
Tang et al. 
Bowen Tang, Skyler T Kramer, Meijuan Fang, Yingkun Qiu, Zhen Wu, and Dong Xu.
A self-attention based message passing neural network for predicting molecular lipophilicity and aqueous solubility.Journal of cheminformatics, 12(1):1–9, 2020.
- Cherkasov et al.  Artem Cherkasov, Eugene N Muratov, Denis Fourches, Alexandre Varnek, Igor I Baskin, Mark Cronin, John Dearden, Paola Gramatica, Yvonne C Martin, Roberto Todeschini, et al. Qsar modeling: where have you been? where are you going to? Journal of medicinal chemistry, 57(12):4977–5010, 2014.
- Chen et al.  Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. The rise of deep learning in drug discovery. Drug discovery today, 23(6):1241–1250, 2018.
- Matveieva and Polishchuk  Mariia Matveieva and Pavel Polishchuk. Benchmarks for interpretation of qsar models. Journal of cheminformatics, 13(1):1–20, 2021.
- Hamilton et al.  William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216, 2017.
- Xu et al.  Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
Morris et al. 
Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric
Lenssen, Gaurav Rattan, and Martin Grohe.
Weisfeiler and leman go neural: Higher-order graph neural networks.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4602–4609, 2019.
- Ranjan et al.  Ekagra Ranjan, Soumya Sanyal, and Partha Talukdar. Asap: Adaptive structure aware pooling for learning hierarchical graph representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5470–5477, 2020.
- Pal  Mahesh Pal. Random forest classifier for remote sensing classification. International journal of remote sensing, 26(1):217–222, 2005.
- Noble  William S Noble. What is a support vector machine? Nature biotechnology, 24(12):1565–1567, 2006.
- Weininger  David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Brown et al.  Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Raffel et al.  Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- Parmar et al.  Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International Conference on Machine Learning, pages 4055–4064. PMLR, 2018.
Carion et al. 
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander
Kirillov, and Sergey Zagoruyko.
End-to-end object detection with transformers.
European conference on computer vision, pages 213–229. Springer, 2020.
- Tay et al.  Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020.
- Landrum et al.  Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling, 2013.
- Ramsundar et al.  Bharath Ramsundar, Peter Eastman, Patrick Walters, and Vijay Pande. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. O’Reilly Media, 2019.
- Yang et al.  Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8):3370–3388, 2019.
- Ba et al.  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Veličković et al.  Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
- Kipf and Welling  Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Gardner and Dorling 
Matt W Gardner and SR Dorling.
Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences.Atmospheric environment, 32(14-15):2627–2636, 1998.
- Kiwiel  Krzysztof C Kiwiel. Convergence and efficiency of subgradient methods for quasiconvex minimization. Mathematical programming, 90(1):1–25, 2001.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Zhuang et al.  Juntang Zhuang, Tommy Tang, Sekhar Tatikonda, Nicha Dvornek, Yifan Ding, Xenophon Papademetris, and James S Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. arXiv preprint arXiv:2010.07468, 2020.
- Sushko et al.  Iurii Sushko, Sergii Novotarskyi, Robert Körner, Anil Kumar Pandey, Matthias Rupp, Wolfram Teetz, Stefan Brandmaier, Ahmed Abdelaziz, Volodymyr V Prokopenko, Vsevolod Y Tanchuk, et al. Online chemical modeling environment (ochem): web platform for data storage, model development and publishing of chemical information. Journal of computer-aided molecular design, 25(6):533–554, 2011.
- van Breemen and Li  Richard B van Breemen and Yongmei Li. Caco-2 cell permeability assays to measure drug absorption. Expert opinion on drug metabolism & toxicology, 1(2):175–185, 2005.
- Currano and Roth  Judith Currano and Dana Roth. Chemical information for chemists: a primer. Royal Society of Chemistry, 2014.
- Mortelmans and Zeiger  Kristien Mortelmans and Errol Zeiger. The ames salmonella/microsome mutagenicity assay. Mutation research/fundamental and molecular mechanisms of mutagenesis, 455(1-2):29–60, 2000.