Log In Sign Up

Molecule Attention Transformer

by   Łukasz Maziarka, et al.

Designing a single neural network architecture that performs competitively across a range of molecule property prediction tasks remains largely an open challenge, and its solution may unlock a widespread use of deep learning in the drug discovery industry. To move towards this goal, we propose Molecule Attention Transformer (MAT). Our key innovation is to augment the attention mechanism in Transformer using inter-atomic distances and the molecular graph structure. Experiments show that MAT performs competitively on a diverse set of molecular prediction tasks. Most importantly, with a simple self-supervised pretraining, MAT requires tuning of only a few hyperparameter values to achieve state-of-the-art performance on downstream tasks. Finally, we show that attention weights learned by MAT are interpretable from the chemical point of view.


KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction

Designing accurate deep learning models for molecular property predictio...

Relative Molecule Self-Attention Transformer

Self-supervised learning holds promise to revolutionize molecule propert...

DPA-1: Pretraining of Attention-based Deep Potential Model for Molecular Simulation

Machine learning assisted modeling of the inter-atomic potential energy ...

Improving Molecular Pretraining with Complementary Featurizations

Molecular pretraining, which learns molecular representations over massi...

Geometric Transformer for End-to-End Molecule Properties Prediction

Transformers have become methods of choice in many applications thanks t...

MOFormer: Self-Supervised Transformer model for Metal-Organic Framework Property Prediction

Metal-Organic Frameworks (MOFs) are materials with a high degree of poro...

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

GNNs and chemical fingerprints are the predominant approaches to represe...

1 Introduction

The task of predicting properties of a molecule lies at the center of applications such as drug discovery or material design. In particular, estimated

drug candidates fail the clinical trials in the United States after a long and costly development process (Wong et al., 2018). Potentially, many of these failures could have been avoided by having correctly predicted a clinically relevant property of a molecule such as its toxicity or bioactivity.

Following the breakthroughs in image (Krizhevsky et al., 2012) and text classification (Vaswani et al., 2017), deep neural networks (DNNs) are expected to revolutionize other fields such as drug discovery or material design (Jr et al., 2019). However, on many molecular property prediction tasks DNNs are outperformed by shallow

models such as support vector machine or random forest 

(Korotcov et al., 2017; Wu et al., 2018). On the other hand, while DNNs can outperform shallow models on some tasks, they tend to be difficult to train (Ishiguro et al., 2019; Hu et al., 2019), and can require tuning of a large number of hyperparameters. We also observe both issues on our benchmark (see Section 4.2).

Making deep networks easier to train has been the central force behind their widespread use. In particular, one of the most important breakthroughs in deep learning was the development of initialization methods that allowed to train easily deep networks end-to-end (Goodfellow et al., 2016). In a similar spirit, our aim is to develop a deep model that is simple to use out-of-the-box, and achieves strong performance on a wide range of tasks in the field of molecule property prediction.

In this paper we propose the Molecule Attention Transformer (MAT). We adapt Transformer (1) to chemical molecules by augmenting the self-attention with inter-atomic distances and molecular graph structure. Figure 1 shows the architecture. We demonstrate that MAT, in contrast to other tested models, achieves strong performance across a wide range of tasks (see Figure 2). Next, we show that self-supervised pre-training further improves performance, while drastically reducing the time needed for hyperparameter tuning (see Table 3). In these experiments we tuned only the learning rate, testing different values. Finally, we find that MAT has interpretable attention weights. We share pretrained weights at

2 Related work

Figure 1:

Molecule Attention Transformer architecture. We largely base our model on the Transformer encoder. In the first layer we embed each atom using one-hot encoding and atomic features. The main innovation is the Molecule Multi-Head Self-Attention layer that augments attention with distance and graph structure of the molecule. We implement this using a weighted (by

, , and ) element-wise sum of the corresponding matrices.

Molecule property prediction.

Predicting properties of a candidate molecule lies at the heart of many fields such as drug discovery and material design. Broadly speaking, there are two main approaches to predicting molecular properties. First, we can use our knowledge of the underlying physics (Lipinski et al., 1997). However, despite recent advances (Schütt et al., 2017), current approaches remain prohibitively costly to accurately predict many properties of interest such as bioactivity. The second approach is to use existing data to train a predictive model (Haghighatlari and Hachmann, 2019). Here the key issue is the lack of large datasets. Even for the most popular drug targets, such as 5-HT1A (a popular target for depression), only thousands of active compounds are known. Promising direction is using hybrid approaches such as Wallach et al. (2015) or approaches leveraging domain knowledge and underlying physics to impose a strong prior such as Feinberg et al. (2018).

Deep learning for molecule property prediction.

Deep learning has become a valuable tool for modeling molecules. During the years, the community has progressed from using handcrafted representations to representing molecules as strings of symbols, and finally to the currently popular approaches based on molecular graphs.

Graph convolutional networks in each subsequent layer gather information from adjacent nodes in the graph. In this way after convolution layers each node has information from its -edges distant neighbors. Using the graph structure improves performance in a range of molecule modeling tasks (Wu et al., 2018). Some of the most recent works implement more sophisticated generalization methods for gathering the neighbor data. Veličković et al. (2017); Shang et al. (2018) propose to augment GCNs with an attention mechanism. Li et al. (2018) introduces a model that dynamically learns neighbourhood function in the graph.

In parallel to these advances, using the three-dimensional structure of the molecule is becoming increasingly popular. Perhaps the most closely related models are 3D Graph Convolutional Neural Network (3DGCN), Message Passing Neural Network (MPNN), and Adaptive Graph Convolutional Network (AGCN) (Cho and Choi, 2018; Gilmer et al., 2017; Li et al., 2018). 3DGCN and MPNN integrate graph and distance information in a single model, which enables them to achieve strong performance on tasks such as solubility prediction. In contrast to them, we additionally allow for a flexible neighbourhood based on self-attention.

Transformer, originally developed for natural language processing 

(Vaswani et al., 2017), has been recently applied to retrosynthesis in Karpov et al. (2019). They represent compounds as sentences using the SMILES notation (Weininger, 1988). In contrast to them, we represent compounds as a list of atoms, and ensure that models understand the structure of the molecule by augmenting the self-attention mechanism (see Figure 1). Our ablation studies show it is a critical component of the model.

To summarize, methods related to our model have been proposed in the literature. Our contribution is unifying these ideas in a single model based on the state-of-the-art Transformer architecture that preserves strong performance across many chemical tasks.

How easy is it to use deep learning for molecule property prediction?

DNNs performance is not always competitive to methods such as support vector machine or random forest. MoleculeNet is a popular benchmark for methods for molecule property prediction (Wu et al., 2018) that demonstrates this phenomenon. Similar results can be found in Withnall et al. (2019). We reproduce a similar issue on our benchmark. This makes using deep learning less applicable to molecule property prediction because in some cases practitioners might actually benefit from using other methods. Another issue is that graph neural networks, which are the most popular class of models for molecule property prediction, can be difficult to train. Ishiguro et al. (2019) show and try to address the problem that graph neural networks tend to underfit the training set. We also reproduce this issue on our benchmark (see also App. C).

There has been a considerable interest in developing easier to use deep models for molecule property prediction. Goh et al. (2017) pretrains a deep network that takes as an input an image of a molecule. Another studies highlight the need to augment feedforward (Mayr et al., 2018) and graph neural networks (Yang et al., 2019) with handcrafted representations of molecules. Hu et al. (2019) proposes pretraining methods for graph neural networks and shows this largely alleviates the problem of underfitting, present in these architectures (Ishiguro et al., 2019). We take inspiration from Hu et al. (2019) and use one of the three pretraining tasks proposed therein.

Concurrently, Wang et al. (2019); Honda et al. (2019) pretrain a vanilla Transformer (1) that takes as input a text representation (SMILES) of a molecule. Honda et al. (2019) shows that decoding based approach improves data efficiency of the model. A similar approach, specialized to the task of drug-target interaction prediction, was concurrently proposed in Shin et al. (2019). In contrast to them, we adapt Transformer to chemical structures, which in our opinion is crucial for achieving strong empirical performance. We also use a domain-specific pretraining based on Wu et al. (2018). We further confirm importance of both approaches by comparing directly with Honda et al. (2019).

Self-attention based models.

Arguably, the attention mechanism (Bahdanau et al., 2014) has been one of the most important breakthroughs in deep learning. This is perhaps best illustrated by the wide-spread use of Transformer architecture in natural language processing (A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017); 1).

Multiple prior works have augmented self-attention in Transformer using domain-specific knowledge (Chen et al., 2018; Shaw et al., 2018; Bello et al., 2019; Guo et al., 2019). Guo et al. (2019) encourages Transformer to attend to adjacent words in a sentence, and Chen et al. (2018) encourages another attention-based model to focus on pairs of words in a sentence that are connected in an external knowledge base. Our novelty is applying this successive modeling idea to molecule property prediction.

3 Molecule Attention Transformer

As the rich literature on deep learning for molecule property prediction suggests, it is necessary for a model to be flexible enough to represent a range of possible relationships between atoms of a compound. Inspired by its flexibility and strong empirical performance, we base our model on the Transformer encoder (A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017); 1). It is worth noting that natural language processing has inspired important advances in cheminformatics (Segler et al., 2017; Gómez-Bombarelli et al., 2018), which might be due to similarities between the two domains (Jastrzębski et al., 2016).


We begin by briefly introducing the Transformer architecture. On a high level, Transformer for classifications has

attention blocks followed by a pooling and a classification layer. Each attention block is composed of a multi-head self-attention layer, followed by a feed-forward block that includes a residual connection and layer normalization.

The multi-head self-attention is composed of heads. Head () takes as input hidden state and computes first , , and . These are used in the attention operation as follows:


Molecule Self-Attention.

Using a naive Transformer architecture would require encoding of chemical molecules as sentences. Instead, inspired by Battaglia et al. (2018), we interpret the self-attention as a soft adjacency matrix between the elements of the input sequence. Following this line of thought, it is natural to augment the self-attention using information about the actual structure of the model. This allows us to avoid using linearized (textual) representation of molecule as input Jastrzębski et al. (2016), which we expect to be a better inductive bias for the model.

More concretely, we propose the Molecule Self-Attention layer, which we describe in Equation 2. We augment the self-attention matrix as follows: let denote the graph adjacency matrix, and . denote the inter-atomic distances. Let , , and denote scalars weighting the self-attention, distance, and adjacency matrices. We modify Equation 1 as follows:


see also Figure 1. We denote , , and jointly as . We use as either softmax (normalized over the rows), or an element-wise . Finally, the distance matrix is computed using RDKit package (Landrum, 2016).

Note that while we use only the adjacency and the distance matrices, MAT can be easily extended to include other types of information, e.g. forces between the atoms.

Molecule Attention Transformer.

To define the model, we replace all self-attention layers in the original Transformer encoder by our Molecular Self Attention layers. We embed each atom as a dimensional vector following (Coley et al., 2017), shown in Table 1. In the experiments, we treat , , and as hyperparameters and keep them frozen during training. Figure  1 illustrates the model.


We experiment with one of the two node-level pretraining tasks proposed in Hu et al. (2019), which involves predicting the masked input nodes. Consistently with Hu et al. (2019), we found it stabilizes learning (see Figure 6) and reduces the need for an extensive hyperparameter search (see Table 3). Given that MAT already achieves good performance using this simple pretraining task, we leave for future work exploring the other tasks proposed in Hu et al. (2019).

Other details.

Inspired by Li et al. (2017); Clark et al. (2019), we add an artificial dummy node to the molecule. The dummy node is not connected by an edge to any other atom and the distance to any of them is set to . Our motivation is to allow the model to skip searching for a molecular pattern if none is to find by putting higher attention on that distant node, which is similar to how BERT uses the separation token (1; K. Clark, U. Khandelwal, O. Levy, and C. D. Manning (2019)). We confirm this intuition in Section 4.4 and Section 4.5.

Finally, the distance matrices are calculated from 3D conformers calculated using UFFOptimizeMolecule function from the RDKit package (Landrum, 2016), and the default parameters (maxIters=, vdwThresh=, confId=, ignoreInterfragInteractions=True). For each compound we use one pre-computed conformation. We experimented with sampling more conformations for each compound, but did not observe a consistent boost in performance, however it is possible that using more sophisticated algorithms for compound 3D structure minimization could improve the results. We leave this for future work.

Indices Description
Atomic identity as a one-hot vector of B, N, C, O, F, P, S, Cl, Br, I, Dummy, other
Number of heavy neighbors as one-hot vector of 0, 1, 2, 3, 4, 5
Number of hydrogen atoms as one-hot vector of 0, 1, 2, 3, 4
Formal charge
Is in a ring
Is aromatic
Table 1: Featurization used to embed atoms in MAT.

4 Experiments

We begin by comparing MAT to other popular models in the literature on a wide range of tasks. We find that with simple pretraining MAT outperforms other methods, while using a small budget for hyperparameter tuning.

In the rest of this section we try to develop understanding of what makes MAT work well. In particular, we find that individual heads in the multi-headed self-attention layers learn chemically interpretable functions.

4.1 Experimental settings

Comparing different models for molecule property prediction is challenging. Despite considerable efforts, the community still lacks a standardized way to compare different models. In our work, we use a similar setting to MoleculeNet (Wu et al., 2018).


Following recommendations of Wu et al. (2018) and the experimental setup of Podlewska and Kafel (2018), we use random split for FreeSolv, ESOL, and MetStab. For all the other datasets we use scaffold split, which assigns compounds that share the same molecular scaffolding to different subsets of the data Bemis and Murcko (1996). In regression tasks, the property value was standardized. Test performance is based on the model which gave best results in the validation setting. Each training was repeated times, on different train/validation/test splits. All the other experimental details are reported in the Supplement.


We run experiments on a wide range of datasets that represent typical tasks encountered in molecule modeling. Below, we include a short description of these tasks, and a more detailed description is moved to App. A.

  • FreeSolv, ESOL. Regression tasks used in Wu et al. (2018) for predicting water solubility in terms of the hydration free energy (FreeSolv) and log solubility in mols per litre (ESOL). The datasets have and molecules, respectively.

  • Blood-brain barrier permeability (BBBP). Binary classification task used in Wu et al. (2018) for predicting the ability of a molecule to penetrate the blood-brain barrier. The dataset has molecules.

  • Estrogen Alpha, Estrogen Beta. The tasks are to predict whether a compound is active towards a given target (Estrogen-, Estrogen-) based on experimental data from the ChEMBL database (Gaulton et al., 2011). The datasets have , and molecules, respectively.

  • MetStabhigh, MetStablow. Binary classification tasks based on data from Podlewska and Kafel (2018) to predict whether a compound has high (over 2.32 h half-time) or low (lower than 0.6 h half-time) metabolic stability. Both datasets contain the same molecules.

4.2 Molecule Attention Transformer

(a) Hyperparameter search budget of 500 combinations.
(b) Hyperparameter search budget of combinations.
Figure 2: The average rank across the 7 datasets in the benchmark. For each model we test (left) or (right) hyperparameter combinations. We split the data using random or scaffold split (according to the dataset description) 6 times into train/validation/test folds and use the mean metrics across the test folds to obtain the ranklists of models. Interestingly, shallow models (RF and SVM) outperform graph models (GCN, EAGCN and Weave).
BBBP ESOL FreeSolv Estrogen- Estrogen- MetStablow MetStabhigh
SVM .707 .000 .478 .054 .461 .077 .973 .000 .778 .000
RF .725 .006 .534 .073 .523 .097 .977 .001 .885 .029 .888 .030
GCN .712 .010 .357 .032 .271 .048 .975 .003 .730 .006 .881 .031 .875 .036
Weave .701 .016 .311 .023 .311 .072 .974 .003 .769 .023 .863 .028 .882 .043
EAGCN .680 .014 .316 .024 .345 .051 .961 .011 .781 .012 .883 .024 .868 .034
MAT (ours) .765 .007 .862 .038 .888 .027
(a) Hyperparameter search budget of combinations.
BBBP ESOL FreeSolv Estrogen- Estrogen- MetStablow MetStabhigh
SVM .723 .000 .479 .055 .461 .077 .973 .000 .772 .000
RF .721 .003 .534 .073 .524 .098 .977 .001 .892 .026 .888 .030
GCN .695 .013 .369 .032 .299 .068 .975 .003 .730 .006 .884 .033 .875 .036
Weave .702 .009 .298 .025 .298 .049 .974 .003 .769 .023 .863 .028 .885 .042
EAGCN .680 .014 .322 .052 .337 .042 .961 .011 .781 .012 .859 .024 .844 .037
MAT (ours) .765 .007 .861 .029 .844 .052
(b) Hyperparameter search budget of combinations.
Table 2: Test performances in the benchmark. For each model we test (top) and (bottom) hyperparameter combinations. On ESOL and FreeSolv we report RMSE (lower is better). The other tasks are evaluated using ROC AUC (higher is better). Experiments are repeated times.
Figure 3: The average ranks across the 7 datasets in the benchmark. Pretrained MAT outperforms the other methods, despite a drastically smaller number of tested hyperparameters () compared to MAT and EAGCN ().


Similarly to Wu et al. (2018), we test a comprehensive set of baselines that span both shallow and deep models. We compare MAT to the following baselines: GCN Duvenaud et al. (2015), Random Forest (RF) and Support Vector Machine with RBF kernel (SVM). We also test the following recently proposed models: Edge Attention-based Multi-Relational Graph Convolutional Network (EAGCN) (Shang et al., 2018), and Weave (Kearnes et al., 2016).

Hyperparameter tuning.

For each method we extensively tune their hyperparameters using random search (Bergstra and Bengio, 2012). To ensure fair comparison, each model is given the same budget for hyperparameter search. We run two sets of experiments with budget of and evaluations. We include hyperparameter ranges in App. B.


We evaluate models by their average rank according to the test set performance on the datasets. Figure 2 reports ranks of all methods for the two considered hyperparameter budgets ( and ). Additionally, we report in Table 2 detailed scores on all datasets. We make three main observations.

First, graph neural networks (GCN, Weave, EAGCN) on average do not outperform the other models. The best graph model achieves average rank compared to by RF. On the whole, performance of the deep models improves with larger hyperparameter search budget. This further corroborates the original motivation of our study. Indeed, using common deep learning methods for molecule property prediction is challenging in practice. It requires a large computational budget, and might still result in poor performance.

Second, MAT outperforms the other tested methods in terms of the average rank. MAT achieves average rank of and for and budgets, compared to of RF, which is the second best performing model. This shows that architecture MAT is flexible enough and has the correct inductive bias to perform well on a wide range of tasks.

Examining performance of MAT across individual datasets, we observe that RF and SVM perform better on Estrogen-, MetStablow, and MetStabhigh. Both RF and SVM use extended-connectivity fingerprint (Rogers and Hahn, 2010) as input representation, which encodes substructures in the molecule as features. Metabolic stability of a compound depends on existence of particular moieties, which are recognized by enzymes. Therefore a simple structure-based fingerprints perform well in such a setting. Wang et al. (2019); Mayr et al. (2018) show that using fingerprint as input representation improves performance of deep networks on related datasets. These two observations suggest that MAT could benefit from using fingerprints. Instead, we avoid using handcrafted representations, and investigate pretraining as an alternative in the next section. Though fingerprint-based models show excellent performance in all presented tasks, there are datasets on which they fail to match the performance of graph approaches. We observed this also on an energy prediction task (see the extension of our benchmark in App. C).

4.3 Pretrained Molecule Attention Transformer

Self-supervised pretraining has revolutionized natural language processing (1) and has improved performance in molecule property prediction (Hu et al., 2019). We apply here node-level self-supervised pretraining from Hu et al. (2019) to MAT. The task is to predict features of masked out nodes. We refer the reader to App. D for more details.


We compare MAT to the two following baselines. First, we apply the same pretraining to EAGCN, which we will refer to as “Pretrained EAGCN”. Second, we compare to a concurrent work by Honda et al. (2019). They pretrain a vanilla Transformer by decoding textual representation (SMILES) of molecules. We will refer to their method as “SMILES Transformer”.


For all methods that use pretraining we reduce the hyperparameter grid to a minimum. We tune only the learning rate in . We set the other hyperparameters to reasonable defaults based on results from Section  4.2. For MAT and EAGCN, we follow 1 and use the largest model that still fits the GPU memory. For SMILES Transformer we use pretrained weights provided by Honda et al. (2019).


As in previous section, we compare the models based on their average rank on our benchmark. Figure  3 and Table 3 summarize the results.

We observe that Pretrained MAT achieves average rank of and outperforms MAT (average rank of ). Importantly, for Pretrained MAT we only tuned the learning rate by evaluating different values. This is in stark contrast to the hyperparameter combinations tested for MAT and EAGCN. To visualize this, in Figure 4 we plot the average test performance of all models as a function of the number of tested hyperparameter combinations. We also note that Pretrained MAT is more competitive on the three datasets mentioned in the previous section.

We also find that Pretrained MAT outperforms the other two pretrained methods. Pretraining degrades the performance of EAGCN (average rank of ), and SMILES Transformer achieves the worst average rank (average rank of ). This suggests that both the architecture, and the choice of the pretraining task are important for the overall performance of the model.

BBBP ESOL FreeSolv Estrogen- Estrogen- MetStablow MetStabhigh
EAGCN .687 .023 .323 .031 1.244 .341 .994 .002 .770 .010 .861 .029 .839 .038
SMILES .717 .008 .356 .017 .393 .032 .953 .002 .757 .002 .860 .038 .881 .036
Table 3: Test set performances of methods that use pretraining. Experiments are repeated times. SMILES refers to SMILES Transformer from Honda et al. (2019).
(a) Regression tasks.
(b) Classification tasks.
Figure 4: Test performance of all models as a function of the number of tested hyperparameter combinations (on a logarithmic scale). Figures show the aggregated mean RMSE for regression tasks (left) and the aggregated mean ROC AUC for classification tasks (right). Pretrained MAT requires tuning an order of magnitude less hyperparameters, and performs competitively on both sets of tasks.

4.4 Ablation studies

To better understand what contributes to the performance of MAT, we run a series of ablation studies on three representative datasets from our benchmark. We leave understanding how these choices interact with pretraining for future work.

For experiments in this section we generated additional splits for ESOL, FreeSolv and BBBP datasets (different than in Section 4.2). For each configuration we select the best hyperparameters settings using random search under a budget of evaluations. Experiments are repeated times.

Dummy node is not so dummy.

MAT .250 .007
- Dummy .714 .010 .317 .014
Table 4: Test performance of MAT model variant without the dummy node (- Dummy) compared to performance of the original MAT.

MAT uses a dummy node that is disconnected from other atoms in the graph (Li et al., 2017). Our intuition is that such functionality can be useful to automatically adapt capacity on small datasets. By attending to the dummy node, the model can effectively choose to avoid changing the internal representation in a given layer. To examine this architectural choice, in Table 4 we compare MAT to a variant that does not include the dummy node. Results show that dummy node improves performance of the model.

Knowing molecular graph and distances between atoms improves performance.

Our key architectural innovation is integrating the molecule graph and inter-atomic distances with the self-attention layer in Transformer, as shown in Figure 1. To probe the importance of each of these sources of information, we removed them individually during training. Results in Table 5 suggest that keeping all sources of information results in the most stable performance across the three tasks, which is our primary goal. We also show that MAT can effectively use distance information in a toy task involving 3-dimensional distances between functional groups (see App.F).

MAT .723 .008 .286 .006
- graph .716 .009 .316 .036 .276 .034
- distance .281 .013
- attention .692 .001 .306 .026 .329 .014
Table 5: Test performance of MAT with different sources of information removed (equivalent to setting the corresponding to zero).

Using a more complex featurization does not improve performance.

Many models for predicting molecule properties use additional edge features (Coley et al., 2017; Shang et al., 2018; Gilmer et al., 2017). In Table 6 we show that adding additional edge features does not improve MAT performance. This is certainly possible that a more comprehensive set of edge features or a better method to integrate them would improve performance, which we leave for future work. Procedure of using edge features is described in detail in App. E.

+ Edges f. .683 .008
Table 6: Test performance of MAT using additional edge features (+ Edges f.), compared to vanilla MAT.

4.5 Analysis.

To understand MAT better, we investigate attention weights of the model, and the effect of pretraining on the learning dynamics.

What is MAT looking at?

In natural language processing, it has been shown that heads in Transformer seem to implement interpretable functions (Htut et al., 2019; Clark et al., 2019). Similarly, we investigate here the chemical function implemented by self-attention heads in MAT. We show patterns found in the model that was pretrained with the atom masking strategy (Hu et al., 2019), and then we verify our findings on a set of molecules extracted from the BBBP testing dataset.

Based on a manual inspection of attention matrices of MAT, we find two broad patterns: (1) many attention heads are almost fully focused on the dummy node, (2) many attention heads focus only on a few atoms. This seems consistent with observations about Transformer in Clark et al. (2019). We also notice that initial self-attention layers learn simple and easily interpretable chemical patterns, while subsequent layers capture more complex arrangements of atoms. In Figure 5 we exemplify attention patterns on a random molecule from the BBBP dataset.

To quantify the above findings, we select six heads from the first layer that fit the second category and seem to implement six patterns: (i) focuses on 2-neighboured aromatic carbons (not substituted); (ii) focuses on sulfurs; (iii) focuses on non-ring nitrogens; (iv) focuses on oxygen in carbonyl groups; (v) focuses on 3-neighboured aromatic atoms (positions of aromatic ring substitutions) and on sulfur for different atoms; (vi) focuses on nitrogens in aromatic rings. We found that on the BBBP testing dataset the atoms corresponding to these definitions (queried with SMARTS expressions) have indeed higher attention weights assigned to them than other atoms. For each head, we calculated attention weights for all atoms in all molecules and compared those matching our hypothesis against the other atoms. Their distributions differ significantly (

in Kruskal-Wallis test) for all the patterns. The statistics and experimental details are summarized in App. 


Figure 5: The heatmaps show selected self-attention weights from the first layer of MAT, on a random molecule from the BBBP dataset (center). The atoms, which these heads focus on, are marked with the same color as the corresponding matrix. The interpretation of the presented patterns is described in the text.

Effect of pretraining.

Wu et al. (2018) observed that using pretraining stabilizes and speeds up training of graph convolutional models. We observe a similar effect in our case. Figure 6 reports training error of MAT and Pretrained MAT on the ESOL (left), and the FreeSolv (right) datasets. We use the learning rate that achieved the best generalization on each dataset in Sec. 4.3. The experiments are repeated

times. On both datasets, Pretrained MAT converges faster and has a lower variance of training error across repetitions. Mean standard deviation of training error for Pretrained MAT (MAT) is

() and () for ESOL and FreeSolv, respectively.

(a) ESOL
(b) FreeSolv
Figure 6: Training of MAT with (blue) and without (orange) pretraining, on ESOL (left) and FreeSolv (right). Pretraining stabilizes training (smaller variance of the training error) and improves convergence speed.

5 Conclusions.

In this work we propose Molecule Attention Transformer as a versatile architecture for molecular property prediction. In contrast to other tested models, MAT performs well across a wide range of molecule property prediction tasks. Moreover, inclusion of self-supervised pretraining further improves its performance, and drastically reduces the need for tuning of hyperparameters.

We hope that our work will widen adoption of deep learning in applications involving molecular property prediction, as well as inspire new modeling approaches. One particularly promising avenue for future work is exploring better pretraining tasks for MAT.


  • [1] Cited by: §1, §2, §2, §3, §3, §4.3, §4.3.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. External Links: 1409.0473 Cited by: §2.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. F. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, Ç. Gülçehre, F. Song, A. J. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu (2018) Relational inductive biases, deep learning, and graph networks. abs/1806.01261. Cited by: §3.
  • I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le (2019) Attention augmented convolutional networks. abs/1904.09925. Cited by: §2.
  • G. W. Bemis and M. A. Murcko (1996) The properties of known drugs. 1. molecular frameworks. 39 (15), pp. 2887–2893. Cited by: §4.1.
  • J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. 13 (1), pp. 281–305. External Links: ISSN 1532-4435 Cited by: §4.2.
  • G. Chen, P. Chen, C. Hsieh, C. Lee, B. Liao, R. Liao, W. Liu, J. Qiu, Q. Sun, J. Tang, et al. (2019) Alchemy: a quantum chemistry dataset for benchmarking ai models. Cited by: Appendix C.
  • Q. Chen, X. Zhu, Z. Ling, D. Inkpen, and S. Wei (2018) Neural natural language inference models enhanced with external knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2406–2417. Cited by: §2.
  • H. Cho and I. S. Choi (2018) Three-dimensionally embedded graph convolutional network (3DGCN) for molecule interpretation. abs/1811.09794. Cited by: §2.
  • K. Clark, U. Khandelwal, O. Levy, and C. D. Manning (2019) What does bert look at? an analysis of bert’s attention. Cited by: §3, §4.5, §4.5.
  • C. W. Coley, R. Barzilay, W. H. Green, T. S. Jaakkola, and K. F. Jensen (2017) Convolutional embedding of attributed molecular graphs for physical property prediction. 57 (8), pp. 1757–1772. Note: PMID: 28696688 External Links: Document Cited by: §3, §4.4.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2224–2232. Cited by: §4.2.
  • E. N. Feinberg, D. Sur, Z. Wu, B. E. Husic, H. Mai, Y. Li, S. Sun, J. Yang, B. Ramsundar, and V. S. Pande (2018) PotentialNet for molecular property prediction. 4 (11), pp. 1520–1530. External Links: Document Cited by: §2.
  • A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani, and J. P. Overington (2011) ChEMBL: a large-scale bioactivity database for drug discovery. 40 (D1), pp. D1100–D1107. External Links: ISSN 0305-1048, Document, Cited by: 4th item, 3rd item.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In

    Proceedings of the 34th International Conference on Machine Learning

    , D. Precup and Y. W. Teh (Eds.),
    Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1263–1272. Cited by: §2, §4.4.
  • G. Goh, C. Siegel, A. Vishnu, N. Hodas, and N. Baker (2017) Chemception: a deep neural network with minimal chemistry knowledge matches the performance of expert-developed qsar/qspr models. pp. . Cited by: §2.
  • R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik (2018) Automatic chemical design using a data-driven continuous representation of molecules. 4 (2), pp. 268–276. Note: doi: 10.1021/acscentsci.7b00572 External Links: Document, ISBN 2374-7943 Cited by: §3.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §1.
  • M. Guo, Y. Zhang, and T. Liu (2019) Gaussian transformer: a lightweight approach for natural language inference. In AAAI 2019, Cited by: §2.
  • M. Haghighatlari and J. Hachmann (2019) Advances of machine learning in molecular modeling and simulation. 23, pp. 51 – 57. Note: Frontiers of Chemical Engineering: Molecular Modeling External Links: ISSN 2211-3398, Document Cited by: §2.
  • S. Honda, S. Shi, and H. R. Ueda (2019) SMILES transformer: pre-trained molecular fingerprint for low data drug discovery. External Links: 1911.04738 Cited by: Appendix D, §2, §4.3, §4.3, Table 3.
  • P. M. Htut, J. Phang, S. Bordia, and S. R. Bowman (2019) Do attention heads in bert track syntactic dependencies?. External Links: 1911.12246 Cited by: §4.5.
  • W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. S. Pande, and J. Leskovec (2019) Pre-training graph neural networks. abs/1905.12265. External Links: 1905.12265 Cited by: Appendix D, §1, §2, §3, §4.3, §4.5.
  • K. Ishiguro, S. Maeda, and M. Koyama (2019) Graph warp module: an auxiliary module for boosting the power of graph neural networks. abs/1902.01020. External Links: 1902.01020 Cited by: Appendix C, §1, §2, §2.
  • S. Jastrzębski, D. Leśniak, and W. M. Czarnecki (2016) Learning to smile (s). Cited by: §3, §3.
  • J. F. R. Jr, L. Florea, M. C. F. de Oliveira, D. Diamond, and O. N. O. Jr (2019) A survey on big data and machine learning for chemistry. External Links: 1904.10370 Cited by: §1.
  • P. Karpov, G. Godin, and I. V. Tetko (2019) A transformer model for retrosynthesis. In International Conference on Artificial Neural Networks, pp. 817–830. Cited by: §2.
  • S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley (2016) Molecular graph convolutions: moving beyond fingerprints. 30, pp. . External Links: Document Cited by: §4.2.
  • S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu, et al. (2018) PubChem 2019 update: improved access to chemical data. 47 (D1), pp. D1102–D1109. Cited by: Appendix F.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Cited by: Appendix B, Appendix D.
  • A. Korotcov, V. Tkachenko, D. P. Russo, and S. Ekins (2017) Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. 14 (12), pp. 4462–4475. Note: doi: 10.1021/acs.molpharmaceut.7b00578 External Links: Document, ISBN 1543-8384 Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §1.
  • G. Landrum (2016)

    RDKit: open-source cheminformatics software

    Cited by: §3, §3.
  • J. Li, D. Cai, and X. He (2017) Learning graph-level representation for drug discovery. Cited by: §3, §4.4.
  • R. Li, S. Wang, F. Zhu, and J. Huang (2018) Adaptive graph convolutional neural networks. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2, §2.
  • C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. 23 (1), pp. 3 – 25. Note: In Vitro Models for Selection of Development Candidates External Links: ISSN 0169-409X, Document Cited by: §2.
  • A. Mayr, G. Klambauer, T. Unterthiner, M. Steijaert, J. K. Wegner, H. Ceulemans, D. Clevert, and S. Hochreiter (2018) Large-scale comparison of machine learning methods for drug target prediction on chembl. 9, pp. 5441–5451. External Links: Document Cited by: §2, §4.2.
  • S. Podlewska and R. Kafel (2018) MetStabOn—online platform for metabolic stability predictions. 19 (4), pp. 1040. Cited by: 3rd item, 4th item, §4.1.
  • B. Ramsundar, P. Eastman, P. Walters, V. Pande, K. Leswing, and Z. Wu (2019) Deep learning for the life sciences. O’Reilly Media. Cited by: Appendix B.
  • D. Rogers and M. Hahn (2010) Extended-connectivity fingerprints. 50 (5), pp. 742–754. Cited by: Appendix B, §4.2.
  • K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko (2017)

    Quantum-chemical insights from deep tensor neural networks

    8, pp. 13890. External Links: Document, 1609.08259 Cited by: §2.
  • M. Segler, T. Kogej, C. Tyrchan, and M. Waller (2017)

    Generating focused molecule libraries for drug discovery with recurrent neural networks

    4, pp. . External Links: Document Cited by: §3.
  • C. Shang, Q. Liu, K. Chen, J. Sun, J. Lu, J. Yi, and J. Bi (2018) Edge Attention-based Multi-Relational Graph Convolutional Networks. pp. arXiv:1802.04944. External Links: 1802.04944 Cited by: Appendix E, §2, §4.2, §4.4.
  • P. Shaw, J. Uszkoreit, and A. Vaswani (2018) Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 464–468. External Links: Document Cited by: §2.
  • B. Shin, S. Park, K. Kang, and J. C. Ho (2019) Self-attention based molecule representation for predicting drug-target interaction. In Proceedings of the 4th Machine Learning for Healthcare Conference, F. Doshi-Velez, J. Fackler, K. Jung, D. Kale, R. Ranganath, B. Wallace, and J. Wiens (Eds.), Proceedings of Machine Learning Research, Vol. 106, Ann Arbor, Michigan, pp. 230–248. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. abs/1706.03762. Cited by: Appendix B, §1, §2, §2, §3.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2017) Graph Attention Networks. pp. arXiv:1710.10903. External Links: 1710.10903 Cited by: §2.
  • I. Wallach, M. Dzamba, and A. Heifets (2015) AtomNet: a deep convolutional neural network for bioactivity prediction in structure-based drug discovery. abs/1510.02855. Cited by: §2.
  • S. Wang, Y. Guo, Y. Wang, H. Sun, and J. Huang (2019) SMILES-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB ’19, New York, NY, USA, pp. 429–436. External Links: ISBN 9781450366663, Document Cited by: §2, §4.2.
  • D. Weininger (1988) SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. 28 (1), pp. 31–36. Cited by: §2.
  • M. Withnall, E. Lindelöf, O. Engkvist, and H. Chen (2019) Building attention and edge convolution neural networks for bioactivity and physical-chemical property prediction. ChemRxiv. Cited by: §2.
  • C. H. Wong, K. W. Siah, and A. W. Lo (2018) Estimation of clinical trial success rates and related parameters. BiostatisticsJACC: Basic to Translational ScienceJournal of Computational ChemistryACS Central ScienceArXivCoRRACS Central ScienceACS Central ScienceJ. Mach. Learn. Res.arXiv preprint arXiv:1906.04341CoRRCoRRJournal of Chemical Information and ModelingChem. Sci.Molecular PharmaceuticsChem. Sci.Journal of chemical information and modelingNature CommunicationsarXiv preprint arXiv:1706.06689CoRRCoRRNucleic Acids ResearchACS central scienceCoRRCoRRTrans. Neur. Netw.Nucleic acids researcharXiv e-printsCoRRJournal of Chemical Information and ModelingJournal of Computer-Aided Molecular DesignJournal of Chemical Information and ModelingarXiv e-printsarXiv e-printsarXiv preprint arXiv:1602.06289International journal of molecular sciencesarXiv preprint arXiv:1904.02679arXiv preprint arXiv:1709.03741arXiv preprint arXiv:1910.05895arXiv preprint arXiv:1607.06450arXiv preprint arXiv:1802.06655arXiv preprint arXiv:1908.04211arXiv preprint arXiv:1810.10182arXiv preprint arXiv:1905.09418arXiv preprint arXiv:1412.6980Advanced Drug Delivery ReviewsCurrent Opinion in Chemical EngineeringJournal of chemical information and modelingJournal of medicinal chemistryJournal of chemical information and computer sciencesarXiv preprint arXiv:1906.09427 20 (2), pp. 273–286. Cited by: §1.
  • Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande (2018) MoleculeNet: a benchmark for molecular machine learning. 9, pp. 513–530. External Links: Document Cited by: §1, §2, §2, §2, 1st item, 2nd item, §4.1, §4.1, §4.2, §4.5.
  • K. Yang, K. Swanson, W. Jin, C. Coley, P. Eiden, H. Gao, A. Guzman-Perez, T. Hopper, B. Kelley, M. Mathea, et al. (2019) Analyzing learned molecular representations for property prediction. 59 (8), pp. 3370–3388. Cited by: §2.

Appendix A Dataset details.

We include below a more detailed description of the datasets used in our benchmark.

  • FreeSolv, ESOL. Regression tasks. Popular tasks for predicting water solubility in terms of the hydration free energy (FreeSolv) and logS (ESOL). Solubility of molecules is an important property that influences the bioavailability of drugs.

  • Blood-brain barrier permeability (BBBP). Binary classification task. The blood-brain barrier (BBB) separates the central nervous system from the bloodstream. Predicting BBB penetration is especially relevant in drug design when the goal for the molecule is either to reach the central nervous system or the contrary – not to affect the brain.

  • MetStabhigh, MetStablow. Binary classification tasks. The metabolic stability of a compound is a measure of the half-life time of the compound within an organism. The compounds for this task were taken from (Podlewska and Kafel, 2018), where compounds were divided into three sets: high, medium, and low stability. In this paper we concatenated these sets in order to build two classification tasks: MetStabhigh (discriminating high against others) and MetStablow (discriminating low against others).

  • Estrogen Alpha, Estrogen Beta. Binary classification tasks. Often in drug discovery, it is important that a molecule is not potent towards a given target. Modulating of the estrogen receptors changes the genomic expression throughout the body, which in turn may lead to the development of cancer. For these tasks, the compounds with known activities towards the receptors were extracted from ChEMBL (Gaulton et al., 2011) database and divided into active and inactive sets based on their reported inhibition constant (Ki), being nM and nM, respectively.

Appendix B Other experimental details

In this section we include details for hyperparameters and training settings used in Section 4.2.

Molecule Atention Trainsformer.

Table 7 shows hyperparameter ranges used in experiments for MAT. A short description of these hyperparameters is listed below:

  • model dim – size of embedded atom features,

  • layers number – number of encoder module repeats ( in Figure 1),

  • attention heads number – number of molecule self-attention heads,

  • PFFs number – number of dense layers in the position-wise feed forward block ( in Figure 1),

  • – self-attention weight ,

  • – distance weight ,

  • distance matrix kernel – function used to transform the distance matrix ,

  • model dropout – dropout applied after the embedding layer, position-wise feed forward layers, and residual layers (before sum operation),

  • weight decay – optimizer weight decay,

  • learning rate – (see Equation 3)

  • epochs number

    – number of epochs for which the model is trained

  • batch size – batch size used during the training of the model

  • warmup factor – fraction of epochs after which we end with increasing the learning rate linearly and begin with decreasing it proportionally to the inverse square root of the step number. (see Equation 3)

batch size 8, 16, 32, 64, 128
learning rate .01, .005, .001, .0005, .0001
epochs 30, 100
model dim 32, 64, 128, 256, 512, 1024
layers number 1, 2, 4, 6, 8
attention heads number 1, 2, 4, 8, 16
PFFs number 1
0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1
0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1
distance matrix kernel ’softmax’, ’exp’
model dropout .0, .1, .2
weight decay .0, .00001, .0001, .001, .01
warmup factor .0, .1, .2, .3, .4, .5
Table 7: Molecule Attention Transformer hyperparameters ranges

As suggested in Vaswani et al. (2017), for optimization of MAT we used Adam optimizer Kingma and Ba (2014), with learning rate scheduler given by the following formula:


Where optimizer factor is given by learning rate and warmup steps is given by warmup factor total train steps number.

After layers embedding of the molecule is calculated by taking the mean of returned by the network vector representations of all atoms (Global pooling in Figure 1). Then it is passed to the single linear layer, which returns the prediction.

SVM, RF, GCN, Weave.

In our experiments, we used DeepChem (Ramsundar et al., 2019) implementation of baseline algorithms (SVM, RF, GCN, Weave). We used the same hyperparameters for tuning as were used in DeepChem, having regard to their proposed default values (we list them in Tables 8 - 11).

RF and SVM work on the vector representation of molecule given by the Extended-connectivity fingerprints (Rogers and Hahn, 2010). ECFP vectors were calculated using class CircularFingerprint from the DeepChem package, with default parameters (radius=2, size=2048).

.25, .4375, .625, .8125, 1., 1.1875, 1.375, 1.5625, 1.75, 1.9375, 2.125, 2.3125, 2.5, 2.6875, 2.875, 3.0625, 3.25, 3.4375, 3.625, 3.8125, 4.
.0125, .021875, .03125, .040625, .05, .059375, .06875, .078125, .0875, .096875, .10625, .115625, .125, .134375, .14375, .153125, .1625, .171875, .18125, .190625, .2
Table 8: SVM hyperparameter ranges

n estimators
125, 218, 312, 406, 500, 593, 687, 781, 875, 968, 1062, 1156, 1250, 1343, 1437, 1531, 1625, 1718, 1812, 1906, 2000
Table 9: RF hyperparameter ranges

batch size 64, 128, 256
learning rate 0.002, 0.001, 0.0005
n filters 64, 128, 192, 256
n fully connected nodes 128, 256, 512
Table 10: GCN hyperparameter ranges

batch size 16, 32, 64, 128
nb epoch 20, 40, 60, 80, 100
learning rate 0.002, 0.001, 0.00075, 0.0005
n graph feat 32, 64, 96, 128, 256
n pair feat 14
Table 11: Weave hyperparameter ranges


Table 12 shows hyperparameter ranges used in experiments for EAGCN. For EAGCN with weighted structure number of convolutional features .

batch size 16, 32, 64, 128, 256, 512
EAGCN structure ’concate’, ’weighted’
num epochs 30, 100
learning rate .01, .005, .001, .0005, .0001
dropout .0, .1, .3
weight decay .0, .001, .01, .0001
n conv layers 1, 2, 4, 6
n dense layers 1, 2, 3, 4
n sgc 1 30, 60
n sgc 2 5, 10, 15, 20, 30
n sgc 3 5, 10, 15, 20, 30
n sgc 4 5, 10, 15, 20, 30
n sgc 5 5, 10, 15, 20, 30
dense dim 16, 32, 64, 128
Table 12: EAGCN hyperparameter ranges

Appendix C Additional results for Sec. 4.2

Predicting internal energy

We run an additional experiment on a regression task related to quantum mechanics. From the Alchemy dataset (Chen et al., 2019), which is a dataset of 12 quantum properties calculated for 200K molecules, we have chosen internal energy at 298.15 K to further test the performance of our model. We hypothesize that our molecule self-attention should perform particularly well in tasks involving atom level interactions such as energy prediction.

Table 13 presents mean absolute errors of three methods: one classical method (RF), one graph method (GCN), and our pretrained MAT. We use original train/valid/test splits of the dataset. For RF and GCN we run a random search with budget of 500 hyperparameter sets. For pretrained MAT, we tune only the learning rate, that is selected from .

MAT achieves a slightly lower error than GCN. As can be expected, both graph methods can learn internal energy function correctly because of the locality preserved in the graph structure. The classical method based on fingerprints gives MAE that is almost two orders of magnitude higher than MAE of the other methods in the comparison.

U (internal energy)
RF .380
GCN .006
MAT .004
Table 13: Test results for internal energy prediction reported as MAE. All methods were tuned with a random search with budget of 500 hyperparameter combinations.

Training error for graph-based neural networks

Ishiguro et al. (2019) show that graph neural networks suffer from underfitting of the training set and their performance does not scale well with the complexity of the network. We reproduce their experiments and confirm that this problem is indeed present for both GCN and MAT. According to Figure 7, the training loss of GCN and MAT flattens at some point and stops decreasing even if we keep increasing the number of layers and model dimensionality. Despite this issue, for almost all settings, MAT achieves lower training error than GCN.

Figure 7: Training loss of MAT and GCN as a function of the number of layers (left) and model dimensionality (right).

Appendix D Additional details for Sec. 4.3

Task description.

As a node-level pretraining task we chose masking from  (Hu et al., 2019) which is a version of BERT masked language model adapted to graph structured data. The idea is that predicting masked nodes based on their neighbourhood will encourage model to capture domain specific relationships between atoms.

For each molecular graph we randomly replace 15% of input nodes (atom attributes) with special mask token. After forward pass we apply linear model to corresponding node embeddings to predict masked node attributes. In case of EAGCN we additionally mask attributes of edges connected to masked nodes to prevent model from learning simple value copying.

Pretraining setting.

Training dataset consisted of 2 mln molecules sampled from the ZINC15 database. Models were trained for 8 epochs with learning rate set to 0.001 and batch size 256. MAT was optimized with Noam optimizer (described in App. B), whereas for EAGCN we used Adam Kingma and Ba (2014). In both cases procedure minimized binary cross entropy loss.

Fine-tuning setting.

All our pretrained models are fine-tuned on the target tasks for epochs, with batch size equal to and learning rate selected from the set of .

In Estrogen Alpha experiments we excluded three molecules (with the highest number of atoms) from the dataset, due to the memory issues.

model dim 1024
layers number 8
attention heads number 16
PFFs number 1
distance matrix kernel ’exp’
model dropout .0
weight decay .0
Table 14: Pretrained MAT hyperparameters

EAGCN structure ’weighted’
dropout .0
weight decay .0
n conv layers 8
n dense layers 1
n sgc 1080
Table 15: Pretrained EAGCN hyperparameters

SMILES Transformer.

We used pretrained weights of SMILES-Transformers conducted by Honda et al. (2019). In this setting, according to the authors, we used MLP with hidden layer, with hidden units, that works on the -dimensional molecule embedding returned by the pretrained transformer. We trained this MLP on the target tasks for epochs, with batch size equal to and learning rate selected from the set of .

Appendix E Additional results for Sec. 4.4

Edge features.

Every bond in the molecule was embedded by the vector of edge features (we used features similar to described in (Shang et al., 2018)

). Every edge feature was then passed through linear layer, followed by ReLU activation, which returned one single value for every single edge (if there is no edge between atoms, we pass zero vector through the layer). This results in the matrix

which was then used in Molecule Self-Attention layer, instead of the adjacency matrix.

Attribute Description
Bond Order Values from set { 1, 1.5, 2, 3 }
Aromaticity Is aromatic
Conjugation Is conjugated
Ring Status Is in a ring
Table 16: Edge Features used for experiments form Table 6

Appendix F Toy task

Task description.

The essential feature of Molecule Attention Transformer is that it augments the self-attention module using molecule structure. Here we investigate MAT on a task heavily reliant on distances between atoms; we are primarily interested in how the performance of MAT depends on , , that are used to weight the adjacency and the distance matrices in Equation 2.

Naturally, many properties of molecules depend on their geometry. For instance, steric effect happens when a spatial proximity of a given group, blocks reaction from happening, due to an overlap in electronic groups. However, this type of reasoning can be difficult to learn based only on the graph information, as it does not always reflect the geometry well. Furthermore, focusing on distance information might require selecting low values for either or (see Figure 1).

To illustrate this, we designed a toy task to predict whether or not two substructures are closer to each other in space than a predefined threshold; see also Figure 8. We expect that MAT will work significantly better than a vanilla graph convolutional network if is tuned well.

Experimental setting.

We construct the dataset by sampling 2677 molecules from PubChem (Kim et al., 2018), and use 20 Å  threshold between -NH fragment and tert-butyl group to determine the binary label. The threshold was selected so that positive and negative examples are well balanced.

Figure 8: The toy task is to predict whether two substructures (-NH fragment and tert-butyl group) co-occur within given distance.
Figure 10: MAT can efficiently use the inter-atomic distances to solve the toy task (see left). Additionally, the performance is heavily dependent on , which motivates tuning in the main experiments (see right).
Figure 9: Molecule Attention Transformer performance on the toy task as a function of , for different settings of and .


First, we plot Molecule Attention Transformer performance as a function of in Figure 9 for three settings of : (blue), (orange), and (green). In all cases we find that using distance information improves the performance significantly. Additionally, we found that GCN achieves AUC on this task, compared to by MAT with . These results both motivate tuning , and show that MAT can efficiently use distance information if it is important for the task at hand.

Further details.

The molecules in the toy task dataset were downloaded from PubChem. The SMARTS query used to find the compounds was (C([C;H3])([C;H3])([C;H3]).[NX3H2]). All molecules were then filtered so that only those with exactly one tert-butyl group and one -NH fragment were left. For each of them, five conformers were created with RDKit implementation of the Universal Force Field (UFF).

The task is a binary classification of the distance between two molecule fragments. If the euclidean distance between -NH fragment and tert-butyl group is greater than a given threshold, the label is 1 (0 otherwise). As the distance we mean the distance between the closest heavy atoms in these two fragments across calculated conformers. We used 20 Å  as the threshold as it leads to a balanced dataset. There are 2677 compounds total from which 1140 are in a positive class. The dataset was randomly split into training, validation, and test datasets.

In experiments the hyperparameters that yielded promising results on our datasets were used (listed in Table 17). The values of parameters were tuned, and their scores are shown in Figure 9. All three parameters (, , ) sum to 1 in all experiments.

To compare our results with a standard graph convolutional neural network, we run a grid search over hyperparameters shown in Table 18. The hyperparameters for which the best validation AUC score was reached are emboldened, and their test AUC score is .

batch size 16
learning rate 0.0005
epochs 100
model dim 64
model N 4
model h 8
model N dense 2
model dense output nonlinearity ’tanh’
distance matrix kernel ’softmax’
model dropout 0.0
weight decay 0.001
optimizer ’adam_anneal’
aggregation type ’mean’
Table 17: MAT hyperparameters used.

batch size 16, 32, 64
learning rate 0.0005
epochs 20, 40, 60, 80, 100
n filters 64, 128
n fully connected nodes 128, 256
Table 18: Hyperparameters used for tuning GCN.

Appendix G Interpretability analysis

Head i ii iii iv v vi
SMARTS [c;D2] [S,s] [N;R0] O=* [a;D3] n
.136 .330 .061 .095 .043 .228
.080 .280 .074 .120 .032 .171
.008 .001 .002 .006 .006 .005
.032 .003 .016 .034 .014 .009
Table 19: Statistics of the six attention head patterns described in the text. Each head function is defined by a SMARTS that selects atoms with high attention weights. For each atom in the dataset we calculated mean weight assigned to them by the corresponding attention head (average column value of the attention matrix). Calculated means and standard deviations show the difference between attention weights of matching atoms (, ) against the other atoms (, ).

We found several patterns in the self-attention heads by looking at the first layer of MAT. These patterns correspond to chemical structures that can be found in molecules. For each such pattern found in a qualitative manner, we tested quantitatively if our hypotheses are true about what these particular attention heads represent.

For each pattern found in one of the attention heads, we construct a SMARTS expression describing atoms that belong to our hypothetical molecular structures. Then, all atoms matching the pattern are extracted from the BBBP dataset, and their mean attention weights (average column value of the attention matrix) are compared against atoms that do not match the pattern. Table 19 shows the distributions of attention weights for matching and not matching atoms. Atoms which match the SMARTS expression have significantly higher attention weights ().