GEOM: Energy-annotated molecular conformations for property prediction and molecular generation

by   Simon Axelrod, et al.
Harvard University

Machine learning outperforms traditional approaches in many molecular design tasks. Although molecules are often thought of as 2D graphs, they in fact consist of an ensemble of inter-converting 3D structures called conformers. Molecular properties arise from the contribution of many conformers, and in the case of a drug binding a target, may be due mainly to a few distinct members. Molecular representations in machine learning are typically based on either one single 3D conformer or on a 2D graph that strips geometrical information. No reference datasets exist that connect these graph and point cloud ensemble representations. Here, we use first-principles simulations to annotate over 400,000 molecules with the ensemble of geometries they span. The Geometrical Embedding Of Molecules (GEOM) dataset contains over 33 million molecular conformers labeled with their relative energies and statistical probabilities at room temperature. This dataset will assist benchmarking and transfer learning in two classes of tasks: inferring 3D properties from 2D molecular graphs, and developing generative models to sample 3D conformations.



There are no comments yet.


page 1

page 2

page 3

page 4


Transfer Learning Using Ensemble Neural Nets for Organic Solar Cell Screening

Organic Solar Cells are a promising technology for solving the clean ene...

Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models

We introduce a new molecular dataset, named Alchemy, for developing mach...

Graph Energy-based Model for Substructure Preserving Molecular Design

It is common practice for chemists to search chemical databases based on...

Property-aware Adaptive Relation Networks for Molecular Property Prediction

Molecular property prediction plays a fundamental role in drug discovery...

Augmenting Molecular Images with Vector Representations as a Featurization Technique for Drug Classification

One of the key steps in building deep learning systems for drug classifi...

Predicting drug properties with parameter-free machine learning: Pareto-Optimal Embedded Modeling (POEM)

The prediction of absorption, distribution, metabolism, excretion, and t...

Transferring Chemical and Energetic Knowledge Between Molecular Systems with Machine Learning

Predicting structural and energetic properties of a molecular system is ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning outperforms traditional rule-based baselines in many molecule-related tasks, including property prediction and virtual screening Stokes et al. (2020); Gómez-Bombarelli et al. (2016); Zhavoronkov et al. (2019), inverse design using generative models Schwalbe-Koda and Gómez-Bombarelli (2019); Gómez-Bombarelli et al. (2018); Jin et al. (2018); De Cao and Kipf (2018); Li et al. (2018); Dai et al. (2018); Wang et al. (2020); Noé et al. (2019)

, reinforcement learning

Olivecrona et al. (2017); Gottipati et al. (2020); Guimaraes et al. (2017); Popova et al. (2018), differentiable simulators Wang et al. (2020); AlQuraishi (2019); Ingraham et al. (2019), and synthesis planning and retrosynthesis Segler et al. (2018); Coley et al. (2017). These applications have been enabled by reference datasets and tasks Ramsundar et al. (2019), and by algorithmic improvements, especially in representation learning. In particular, graph convolutional Duvenaud et al. (2015); Kearnes et al. (2016); Yang et al. (2019)

, and more recently equivariant neural network architectures

Anderson et al. (2019); Thomas et al. (2018); Klicpera et al. (2020), achieve state-of-the-art performance in a variety of tasks.

1.1 Molecular representations

Unlike other data structures, molecules do not have an obvious basic representation. Strictly, they exist as ensembles of 3D point clouds McNaught and Wilkinson (2009). In chemistry, they are typically represented as graphs with domain-specific annotations to describe spatial arrangement. Molecular representations in machine learning, and the existing reference datasets, typically use either graphs Wu et al. (2019), or a single point cloud per molecule Schütt et al. (2017). Because of the non-overlapping datasets and tasks, the interplay between graph and 3D features remains unexplored from a representation learning perspective.

Molecular representations geared for processing by humans are well-studied McNaught and Wilkinson (2009)

. A molecule is a stable spatial arrangement of atoms. This arrangement fluctuates in time, as it exists on an energy surface with many local minima. It is generally possible to identify chemical bonds that connect pairs of nearby atoms in a molecule. These bonds can be classified in qualitative classes (single, double, etc.). Molecules are represented in chemistry through 2D projections of the atom point cloud onto onto a plane, with bonds symbolized as lines and atoms in the nodes represented as their atomic symbol. The stereochemical formula utilizes perspective and node and edge notation to capture 3D orientation beyond a simple undirected graph, and can also be transformed into a string representation like SMILES

Weininger (1988) or InChi Heller et al. (2015).

Multiple 3D structures can have the same connectivity but different spatial arrangements. The set of stable spatial structures that can inter-convert at room temperature are called conformers. The ensemble of conformers a molecule can access, and their relative populations, are dictated by their relative energies. Conformers are typically not represented explicitly, and the projection formula is understood to embed all possible conformers. It is possible to annotate a molecular entity with one or more valid geometries through physics-based simulations. Finding the conformer with the lowest possible energy, or enumerating all thermally-accessible ones is computationally challenging Fraenkel (1993)

. Indeed, utilizing generative and autoregressive models to guess valid, high-likelihood conformations from a molecular connectivity is an exciting and active area of development

Hoffmann et al. (2019); Simm et al. (2020); Stieffenhofer et al. (2020); Hoffmann and Noé (2019); Imrie et al. (2020); Mansimov et al. (2019); Chan et al. (2019); Gebauer et al. (2019, 2018); Wang and Gómez-Bombarelli (2019).

Thus, two classes of learning tasks and two corresponding representations are generally of interest. Surrogate modeling tasks aim to replace physics simulators, and relate molecular entities existing in one specific point cloud state (conformer) to an expensive simulation outcome. These are natively 3D point cloud tasks. For predicting of experimental properties of chemical species, only the stereochemical projection formula is available as input. Graph-based neural networks, however, cannot process stereochemical features natively and flatten the representation into a plain graph, and thus struggle to differentiate between stereoisomers. Since the low energy conformers will be most abundant in the ensemble, it is common to annotate chemical species data with one arbitrary conformer to add stereochemical data back. These point cloud annotations of the graph are obtained from cheap simulations with poor guaranties of finding representative minima. Furthermore, this addition does not result in significant performance gains Maziarka et al. (2020). This may be due to the fact that the predicted property emerges from the ensemble of conformers, and not from one member.

A number of open questions thus arise in relating these disparate tasks and representations, such as (i) the capacity of graph-based architectures to infer the ensemble of 3D conformers from which properties arise, (ii) novel methods to embed the stereochemical formula without relying on explicitly sampling 3D conformations, (iii) the ability of generative models to replace expensive physics-based generation of conformations, and (iv) whether exhaustive annotation with not only one but a complete conformer ensemble can enhance property prediction.

Because a reference dataset is needed to probe such questions, we report the GEOM (Geometrical Embedding Of Molecules) dataset of conformer ensembles annotated with their relative energies and populations. The dataset covers a broad chemical space of combinatorially generated small molecules and drug-like molecules. We propose synthetic regression tasks on properties of the conformational ensemble spanned by each stereochemical formula, and report the performance of existing and modified graph- and 3D-based baselines. In addition to benchmarks for regression models, GEOM provides data for pre-training on 3D-related tasks and for training generative models for molecular structure.

1.2 Related approaches

A number of reference datasets exist for surrogate simulators and for chemical species property prediction. Simulated data on molecular entities include QM9 Ramakrishnan et al. (2014), with density functional theory (DFT) calculations of molecular properties for one low-energy equilibrium conformer of molecules with fewer than 9 heavy atoms, and the ANI suite of datasets that sample non-equilibrium conformations for QM9 and larger molecules, as well as with higher levels of theory Smith et al. (2017, 2017, 2018). MD-17 has thoroughly-sampled conformations for a small number of molecules with ab initio wavefunction methods and long molecular dynamics simulations 2.

Property prediction of macroscale experimental properties of molecules include quantitative (ESOL Delaney (2004), FreeSolv Mobley and Guthrie (2014), PDBbind-F Wang et al. (2004)) and categorical (BACE, HIV, ClinTox, Tox21, SIDER, BBBP). MoleculeNet Wu et al. (2018) and Deepchem Ramsundar et al. (2019) collects and hosts many of these. Unlabeled molecular sets for generation tasks such as ChemBL Mendez et al. (2018), ZINC Sterling and Irwin (2015) and subsets Gómez-Bombarelli et al. (2018), and related benchmarks are also reported Brown et al. (2019); Polykovskiy et al. (2018).

2 Dataset construction

2.1 Overview

The dataset is available online at 4. A tutorial for loading the data can be found at 27. We used the CREST Grimme (2019) software to generate conformers for 292,035 drug-like molecules and 133,318 molecules from the QM9 dataset. The drug-like molecules were accessed as part of AICures 37, an open machine learning challenge to predict which drugs can be repurposed to treat COVID-19 and related illnesses. In particular, we generated conformers for 278,622 drugs that have been tested for in-vitro inhibition of SARS-CoV 3CL Tokars and Mesecar (data accessed from 1; 411 hits), 5,755 drugs from the Broad Repurposing Hub 19 (data accessed from 3; SARS-CoV 3CL activity treated here as unknown), and 218,632 molecules tested for in vitro inhibition of SARS-CoV PL protease Engel ; Engel (660 hits, with 98% of the molecules also contained in Tokars and Mesecar ). Finally, the dataset contains 2,062 molecules that have been screened for growth inhibition of Pseudomonas aeruginosa (data accessed by request from 5; 23 hits) and 1,580 molecules screened for E. Coli inhibition Stokes et al. (2020); Zampieri et al. (2017) (57 hits). Secondary infections of COVID-19 patients can be caused by both of these pathogens.

Statistics of molecular descriptors for the dataset are given in Table 1

. The dataset of drug-like molecules consists of medium-sized organic compounds, containing an average of 44.2 atoms (24.8 heavy atoms), up to a maximum of 181 atoms (91 heavy atoms). They contain a large variance in flexibility, as demonstrated by the mean (6.5) and maximum (39) number of rotatable bonds. 15% (43,509) of the molecules have specified stereochemistry, while 26% (75,612) have specified or unspecified stereochemistry. The QM9 dataset is limited to 9 heavy atoms (29 total atoms), with a much smaller molecular mass and few rotatable bonds. 72% (95,734) of the species have specified stereochemistry.

2.2 Conformer generation

Generation of conformers ranked by energy is computationally complex. The exhaustive method is to enumerate all the possible rotations around every bond, but this approach scales exponentially Schwab (2010); O’Boyle et al. (2011). Basic algorithms are available in cheminformatics packages such as RDKit 60, but suffer from two flaws. First, they explore conformational space very sparsely through a combination of pre-defined distances and stochastic samples Spellmeyer et al. (1997) and can miss many low-energy conformations. Second, conformer energies are determined with classical force fields, which are rather inaccurate Kanal et al. (2018). By contrast, molecular dynamics simulations, in particular meta-dynamics approaches, can sample conformational space more exhaustively but need to evaluate an energy function many times. Likewise, ab initio methods, such as DFT, can accurately assign energies to conformers but are also orders of magnitude more computationally demanding than force fields.

An efficient balance is offered by the newly developed CREST software Grimme (2019). This program uses semi-empirical tight-binding density functional theory (GFN2-xTB) for energy calculation. The predicted energies are significantly more accurate than classical force fields, accounting for electronic effects, rare functional groups, and bond-breaking/formation labile bonds, but are computationally less demanding than full DFT. Moreover, the search algorithm is based on meta-dynamics, a well-established thermodynamic sampling approach that can efficiently explore the low-energy search space. Finally, the CREST software identifies and groups rotamers, conformers that are identical except for atom re-indexing. It then assigns each conformer a probability through . Here is the statistical weight of the conformer, is its degeneracy (i.e., how many chemically and permutationally equivalent rotamers correspond to the same conformer), is its energy, is the Boltzmann constant, is the temperature, and the sum is over all conformers.

Crest runs on the drug dataset took an average of 2.8 hours of wall time on 32 cores on Knights Landing (KNL) nodes (89.1 core hours), and 0.63 hours on 13 cores on Cascade Lake and Sky Lake nodes (8.2 core hours). QM9 jobs were only performed on the latter two nodes, and took an average of 0.04 wall hours on 13 cores (0.5 core hours). A total of 13 million KNL core hours and 1.2 million Cascade Lake/Sky Lake core hours were used in total.

Drug dataset (N=292,035) QM9 dataset (N=133,318)
Mean Standard deviation Maximum Mean Standard deviation Maximum
Number of atoms 44.2 11.0 181 18.0 3.0 29
Number of heavy atoms 24.8 5.6 91 8.8 0.51 9
Molecular mass (amu) 355.0 78.6 1549.7 122.7 7.6 152.0
Number of rotatable bonds 6.5 2.9 39 2.1 1.6 8
Stereochemistry (specified) 43,509 - - 95,734 - -
Stereochemistry (all) 75,612 - - 95,734 - -
Table 1: Molecular descriptor statistics for the drug-like molecules and the QM9 molecules in the GEOM dataset.
Drug dataset
Mean Standard deviation Maximum
(cal/mol K) 8.2 2.6 16.8
- (kcal/mol) 2.5 0.8 5.0
(kcal/mol) 0.4 0.2 2.4
Conformers 107.3 169.0 7461
QM9 dataset
Mean Standard deviation Maximum
(cal/mol K) 3.9 2.8 14.2
- (kcal/mol) 1.1 0.8 4.3
(kcal/mol) 0.2 0.2 2.2
Conformers 13.4 44.0 1614

Table 2: CREST-based statistics and violin plots for the drug and QM9 datasets.

3 Conformational property prediction

The GEOM dataset is significant for three key reasons. The first is that it provides high-quality 3D structures, energies and probabilities for a large number of drug-like molecules. These expensive annotations may result in increased performance in property prediction tasks. If one is interested in drug repurposing, rather than generation of entirely new molecules, the search space of existing drugs is already annotated and no new conformers are needed. Second, the dataset can be used for training generative models to predict conformations. These models can be used to generate conformations of unseen molecules to bypass ab initio

simulations. Third and most important, the dataset provides summary statistics for each molecule that are related to conformational degrees of freedom (conformational entropy, Gibbs free energy, average energy, and number of unique conformers). All of these are aggregate properties that represent the 3D ensemble, but emerge from the molecular graph in a known way. Hence, this dataset allows to test representation learning strategies throughout across

graph single conformer conformer ensemble on tasks that are ultimately 3D, but fully emergent. This is applicable both as a benchmark task for new architectures, or as a pre-training strategy to be transferred to low-data 3D-driven tasks like drug-target binding.

Where focus on the this third application. We compare different neural network architectures and their ability to predict summary statistics. Moreover, we ask whether limited 3D information, such as that of only the highest-probability conformer, can improve predictive performance.

3.1 Neural network architectures

Here we discuss various 2D and 3D message-passing neural network architectures Gilmer et al. (2017) used to predict molecular properties.

3.1.1 Message passing

A molecule can be thought of as a graph, consisting of a set of nodes (atoms) connected to each other by a set of edges (bonds). Both the nodes and edges have features. The atoms, for example, can be characterized by their atomic number and partial charge. The bonds can be characterized by bond order. Message-passing neural networks use these node and edge features to create a learned fingerprint (representation) for the molecule. This is called the message passing phase. The fingerprint is used as input to a function that predicts a property. This stage is called the readout phase Yang et al. (2019).

The message passing phase consists of steps, or convolutions. In what follows, superscripts denote the convolution number. The node features of the node are , and the edge features between nodes and are . The atom features

are initially mapped to another set of vectors

, termed hidden states. In the convolution, a message is created, which combines and for each pair of nodes and with edge features Gilmer et al. (2017); Yang et al. (2019):


where is the set of neighbors of in graph , and is a message function. The hidden states are updated using a vertex update function :


The readout phase then uses a function to map the final hidden states to a property , through


3.1.2 Learning over 2D graphs

For 2D graphs we adopt the directed message-passing approach of Ref. Dai et al. (2016) with the implementation used in Ref. Yang et al. (2019), the latter of which is called ChemProp. The detailed analysis of Ref. Yang et al. (2019) showed that ChemProp achieves state-of-the-art performance on a wide range of regression and classification tasks. The ChemProp code was accessed through 11. In this implementation, hidden states and messages are used, rather than node-based states and messages . Here the direction matters, as in general and . This implementation helps to avoid messages that loop back to the original node Mahé et al. (2004); Yang et al. (2019).

Hidden states are initialized with


where is a learned matrix, is the concatenation of the atom features for atom and the bond features for bond , and

is the ReLU activation function

Nair and Hinton (2010). The message passing function is simply . The edge update function is the same neural network at each step:


where is a learned matrix with hidden size . Each message-passing phase is then


for . After the final convolution, the atom representation of the molecule is recovered through


The hidden states are then summed to give a feature vector for the molecule: . Properties are predicted through , where

is a feed-forward neural network. In ChemProp the atom features are atom type, number of bonds, formal charge, chirality, number of bonded hydrogen atoms, hybridization, aromaticity, and atomic mass. The bond features are the bond type (single, double, triple, or aromatic), whether the bond is conjugated, whether it is part of a ring, and whether it contains stereochemistry (none, any, E/Z or cis/trans). All features are one-hot encodings. Non-learnable features are incorporated through concatenation with

before applying the readout network. Details of architecture hyperparameters can be found in the SM.

3.1.3 Learning with 3D features

A variety of graph convolutional models have been proposed for learning force fields, which map a set of 3D atomic positions of a molecular entity to an energy. Architectures designed for force fields typically do not incorporate graph information Schütt et al. (2018, 2017); Smith et al. (2017, 2017) since these are broken during chemical reactions and may not be clearly defined. This is contrasted with architectures for property prediction, which are typically graph-based Duvenaud et al. (2015); Yang et al. (2019); Dai et al. (2016) but can benefit from 3D information Gilmer et al. (2017). Here we explore both possibilities. In one case we modify the SchNet force field architecture Schütt et al. (2018, 2017) (code adapted from 62) to predict properties. In a second case we modify the ChemProp model to include distance-based edge features between bonded- and non-bonded atoms. This is in addition to the regular graph edge features between bonded atoms. We call this model ChemProp3D.

In the SchNet model, the feature vector of each atom is initialized with an embedding function. This embedding generates a random vector that is unique to every atom with a given atomic number, and is also learnable. The edge features at each step are generated through a so-called filter network . The filter network converts a distance between two atoms, , into an edge vector . This is accomplished by expanding the distance between atoms and in a basis of Gaussian functions. The centers of these Gaussians are evenly distributed up to a cutoff radius, taken here to be 5.0 Å. This converts a distance into a vector. Further linear and non-linear (shifted-softplus) operations are applied. Because only the distance between two atoms is used to create , the features produced are invariant to rotations and translations.

In each convolution , the new messages and hidden vectors are given by


Here, denotes element-wise multiplication and denotes the so-called interaction block. The interaction block consists of a set of linear and non-linear operations applied atom-wise to the atomic features. These operations are applied before and after multiplication with . In the original SchNet implementation, the readout layer converted each atomic feature vector into a single number, and the numbers were summed to give an energy. Consistent with the ChemProp model and the notion of property prediction, we here instead convert the node features into a molecular fingerprint, and then apply the readout function to the fingerprint. Details of our implementation of the SchNet model can be found in the SM.

We follow the spirit of both SchNet and ChemProp to produce the ChemProp3D model. Rather than only considering neighbors bonded to an atom, we consider all neighbors within a 5 Å cutoff. For bonded neighbors, edge features are a concatenation of bond features and distance features. For non-bonded neighbors, edge features are a zero-array concatenated with distance features. Distances are expanded in a set of 50 Gaussian functions, distributed evenly every 0.1 Å up to a maximum of 5 Å. They are then followed by a fully-connected layer and activation function. That is, in the convolution, distances are converted to vectors through


where is the distance, is the Gaussian function, is a learned SchNet matrix, is a bias, and is the SchNet activation function. Consistent with the original SchNet paper we use the shifted softplus for the distance activation, but use the ReLU in all other places. We also use the ring size as an atomic feature for atoms in rings.

The above discussion applies to molecules associated with one geometry. In the GEOM dataset, however, multiple conformers can be used for a single stereochemical formula. There are two immediate possibilities for pooling these conformers. The first, which we call WeightPool, is to create molecular fingerprints for each conformer, multiply each by its statistical weight, and add them. The second, which we call NnPool, is to use the fingerprint and the statistical weight as inputs to a neural network that generates a final fingerprint. The different pooling options are then:

WeightPool: (10)
NnPool: (11)

The first case multiplies the fingerprint by its weight and sums the result. The second case multiplies a learned matrix , of dimension , with the concatenation of and before summing the result, adding a bias , and applying a non-linear operation . NnPool is of interest for applications in which the target property is dominated by conformers of low statistical weight. This can often be the case in therapeutics, in which a single low-probability conformer can result in high-affinity binding.

3.2 Model performance

Model Learnable Graph 3D (lowest state) 3D (10 lowest states)
ChemProp Y Y N N
SchNet Y N Y N
ChemProp3D Y Y Y N
Weighted E3FP N Y Y Y
Morgan N Y N N
Table 3: Description of the information contained in the molecular fingerprints of various models. Note that an E3FP fingerprint is for a single conformer only, but in this work we use a weighted average of E3FP fingerprints over the 10 lowest-energy conformers.
Model (unique conformers)
ChemProp 0.639 0.096 0.131
SchNet 0.632 0.100 0.139
ChemProp3D 0.572 0.096 0.133
ChemProp + Morgan 0.679 0.105 0.150
ChemProp + weighted E3FP 0.605 0.099 0.129
Morgan 0.824 0.115 0.173
Weighted E3FP 0.966 0.115 0.237
Table 4: Prediction MAE of the different models for three properties related to conformational degrees of freedom of the drug dataset
ChemProp 0.833 0.116
SchNet25-NnPool 0.949 0.118
SchNet25-WeightPool 0.817 0.125
Table 5: Comparison of the performance of ChemProp and conformer-pooled SchNet, using 25 conformers per molecule.

We trained different models to predict three quantities related to conformational information. The first quantity is the ensemble entropy, Grimme (2019), where the sum is over the statistical probabilities of the conformer, and is the gas constant. The conformational Gibbs free energy is related to through Grimme (2019). The conformational entropy is a measure of the conformational degrees of freedom available to a molecule. A molecule with only one conformer has an entropy of exactly 0, while a molecule with equal statistical weight for an infinite number of conformers has infinite conformational entropy. The conformational Gibbs free energy is an important quantity for predicting the binding affinity of a drug to a target. The affinity is determined by the change in Gibbs free energy of the molecule and protein upon binding, which includes the loss of molecular conformational free energy Frederick et al. (2007). The second quantity is the average conformational energy. The average energy is given by , where is the energy of the conformer. Each energy is defined with respect to the lowest-energy conformer. The third quantity is the number of unique conformers for a given molecule, as predicted by CREST within a maximum energy window Grimme (2019).

The models include varying degrees of 3D information and various levels of learnable molecular embeddings. A summary of the information contained in each approach is given in Table 3. In quantum chemistry one often attempts to optimize a geometry so that its energy is at a global minimum. We asked how much this ground state geometry could improve training by incorporating its 3D information in the SchNet and ChemProp3D models. We also considered the impact of graph information, which is contained in all models except for SchNet, as well as non-learnable features, through the inclusion of Morgan Rogers and Hahn (2010) and E3FP Axen et al. (2017) fingerprints. Morgan fingerprints contain only graph information, while E3FP fingerprints also contain 3D information. It is informative to know if limited knowledge of the conformers of a molecule, e.g. through a short MD run, could improve training further. To this end we also incorporated a statistical weight of E3FP fingerprints using only the 10 lowest conformers from each molecule. In this case a fingerprint was produced by multiplying the E3FP fingerprint of each molecule by its statistical weight (properly re-normalized to account for missing conformers) and adding the results.

A description of the architecture and training hyperparameters used for each model can be found in the SM. We used published architecture hyperparameter values and we optimized dropout rates with SigOpt Clark and Hayes (2019)

. 250,000 molecules were used, with 80% for training, 10% for validation, and 10% for testing. The same training, validation, and test splits were used for all models. 2D models were trained for 30 epochs and 3D models for 100 epochs due to slower convergence. We checked that training 2D models past 30 epochs did not improve performance. The mean average error (MAE) was used as a performance metric. In all cases the models with the best validation scores were selected for evaluation on the test set.

The model performance is shown in Table 4, and can be contextualized by analyzing the dataset statistics in Table 2. Note that chemical accuracy for energy prediction is typically considered to be 1 kcal/mol, and sub-chemical accuracy to be 0.24 kcal/mol. It is clear that learnable fingerprint embeddings significantly improve performance, since Morgan and E3FP embeddings alone result in poor performance. The SchNet model, trained on the 3D geometry of the lowest energy conformer, performs comparably to ChemProp in all categories. The combination of ChemProp with 3D information also leads to comparable performance, though it outperforms the ChemProp entropy prediction by 11%. With this learning architecture, one-conformer 3D information can moderately improve performance in some contexts, but the advantage is far from decisive.

Finally, we asked whether 3D models trained on pooled conformers could outperform 2D models. Specifically we asked whether SchNet, a model built to predict energies, could implicitly learn the entropy and average energy associated with an ensemble of geometries. To this end we re-trained SchNet on a sample of 25,000 species, using up to 25 conformers per molecule, for a total of 625,000 geometries. We used both WeightPool and NnPool for SchNet, and compared results to ChemProp trained on the stereochemical formula of the same species. To differentiate this model from the earlier instance of SchNet, which used only a single geometry, we call the models SchNet25-WeightPool and SchNet25-NnPool. Details of the training can be found in the SM. The results are given in Table 5. Interestingly, we see that conformer pooling leads only to minor improvement over 2D models. In particular, the SchNet25-WeightPool model is only 2% better than ChemProp at predicting the entropy, while the NnPool model is significantly worse. The ordering is reversed for the average energy. To see why this is surprising, consider the average energy task as an example. If the model sees 25 conformers per molecule, as well as the average energy, one would expect it to learn which conformations are high-energy. Therefore, when shown a set of conformers for a new species, one would expect it to identify the high-energy structures. ChemProp, by contrast, has no access to this information and must learn from the graph alone. It is therefore intriguing that 3D information from the conformer set would offer no clear advantage. Given the similar performance of the WeightPool model to ChemProp, it appears that the state-of-the-art approaches to fingerprinting 3D structures can be improved for ensemble prediction tasks. This offers an attractive challenge to the machine learning and chemistry communities.

4 Discussion

3D coordinates are important for predicting single-point properties such as energies and forces for one conformation of one molecular entity. However, here we have found mixed results regarding their ability to enhance the accuracy of ensemble-averaged quantity prediction. 2D-based approaches to ensemble property prediction are not significantly improved by 3D information. These results indicate that either 3D information is not useful for these prediction tasks, or that the current models we have used do not leverage 3D information in an optimal way. With access to our dataset, the community can develop improved models for leveraging 3D information for property prediction. Using our data for training, they will also be able to develop generative models that may obviate the need for expensive conformer simulations.

5 Acknowledgements

The authors thank the XSEDE COVID-19 HPC Consortium, project CHE200039, for compute time. NASA Advanced Supercomputing (NAS) Division and LBNL National Energy Research Scientific Computing Center (NERSC), MIT Engaging cluster, Harvard Cannon cluster, and MIT Lincoln Lab Supercloud clusters are gratefully acknowledged for computational resources and support. The authors also thank Christopher E. Henze (NASA) and Shane Canon and Laurie Stephey (NERSC) for technical discussions and computational support, MIT AI Cures ( for molecular datasets and Wujie Wang, Daniel Schwalbe Koda, Shi Jun Ang (MIT DMSE) for scientific discussions and access to computer code. Financial support from DARPA (Award HR00111920025) and MIT-IBM Watson AI Lab is acknowledged.


  • [1] Note: 2020-03-28 Cited by: §2.1.
  • [2] Cited by: §1.2.
  • [3] Note: 2020-03-28 Cited by: §2.1.
  • [4] Note: Cited by: §2.1.
  • [5] (2020) Note: 2020-05-22 Cited by: §2.1.
  • M. AlQuraishi (2019) End-to-End Differentiable Learning of Protein Structure. Cell Systems. External Links: Document, ISSN 2405-4712, Link Cited by: §1.
  • B. Anderson, T. Hy, and R. Kondor (2019) Cormorant: Covariant Molecular Neural Networks. External Links: 1906.04015, Link Cited by: §1.
  • S. D. Axen, X. Huang, E. L. Cáceres, L. Gendelev, B. L. Roth, and M. J. Keiser (2017) A simple representation of three-dimensional molecular structure. Journal of medicinal chemistry 60 (17), pp. 7393–7409. Cited by: §3.2.
  • N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher (2019) {GuacaMol}: Benchmarking Models for de Novo Molecular Design. J. Chem. Inf. Model. 59 (3), pp. 1096–1108. Cited by: §1.2.
  • L. Chan, G. R. Hutchison, and G. M. Morris (2019) Bayesian optimization for conformer generation. Journal of Cheminformatics 11 (1), pp. 32. External Links: Document, ISSN 17582946, Link Cited by: §1.1.
  • [11] Chemprop Machine Learning for Molecular Property Prediction. Note: 2020-03-31 Cited by: §S2.1, §3.1.2, §S3.
  • S. Clark and P. Hayes (2019) SigOpt Web page. Note: Cited by: §S2, §3.2.
  • C. W. Coley, R. Barzilay, T. S. Jaakkola, W. H. Green, and K. F. Jensen (2017) Prediction of Organic Reaction Outcomes Using Machine Learning. ACS Central Science 3 (5), pp. 434–443. External Links: Document, ISSN 2374-7943, Link Cited by: §1.
  • [14] Converts and [sic] xyz file to an RDKit mol object. Note: 2020-04-12 Cited by: §S1.
  • H. Dai, B. Dai, and L. Song (2016) Discriminative embeddings of latent variable models for structured data. In International conference on machine learning, pp. 2702–2711. Cited by: §3.1.2, §3.1.3.
  • H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song (2018)

    Syntax-Directed Variational Autoencoder for Structured Data

    arXiv:1802.08786. External Links: 1802.08786, Link Cited by: §1.
  • N. De Cao and T. Kipf (2018) MolGAN: An implicit generative model for small molecular graphs. arXiv:1805.11973. External Links: Document, 1805.11973, Link Cited by: §1.
  • J. S. Delaney (2004)

    ESOL: Estimating aqueous solubility directly from molecular structure

    Journal of Chemical Information and Computer Sciences 44 (3), pp. 1000–1005. External Links: Document, ISSN 00952338 Cited by: §1.2.
  • [19] Drug Repurposing Hub. Note: Cited by: §2.1.
  • D. K. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional Networks on Graphs for Learning Molecular Fingerprints. In Advances in Neural Information Processing Systems, pp. 2215–2223. External Links: 1509.09292v2, Link Cited by: §1, §3.1.3.
  • [21] D. Engel QHTS of yeast-based assay for SARS-CoV PLP: Hit validation. Note: Cited by: §2.1.
  • [22] D. Engel QHTS of yeast-based assay for SARS-CoV PLP. Note: Cited by: §2.1.
  • A. Fraenkel (1993) Complexity of protein folding. Bulletin of Mathematical Biology 55 (6), pp. 1199–1210. External Links: Document, ISSN 00928240, Link Cited by: §1.1.
  • K. K. Frederick, M. S. Marlow, K. G. Valentine, and A. J. Wand (2007) Conformational entropy in molecular recognition by proteins. Nature 448 (7151), pp. 325–329. Cited by: §3.2.
  • N. W. A. Gebauer, M. Gastegger, and K. T. Schütt (2018) Generating equilibrium molecules with deep neural networks. arXiv:1810.11347. External Links: 1810.11347, Link Cited by: §1.1.
  • N. W. A. Gebauer, M. Gastegger, and K. T. Schütt (2019) Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. arXiv:1906.00957. External Links: 1906.00957, Link Cited by: §1.1.
  • [27] GEOM: energy-annotated molecular conformations. Note: Cited by: §S1, §2.1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §3.1.1, §3.1.3, §3.1.
  • R. Gómez-Bombarelli, J. Aguilera-Iparraguirre, T. D. Hirzel, D. Duvenaud, D. Maclaurin, M. A. Blood-Forsythe, H. S. Chae, M. Einzinger, D. G. Ha, T. Wu, G. Markopoulos, S. Jeon, H. Kang, H. Miyazaki, M. Numata, S. Kim, W. Huang, S. I. Hong, M. Baldo, R. P. Adams, and A. Aspuru-Guzik (2016) Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nature Materials 15 (10), pp. 1120–1127 (en). External Links: Document, ISSN 1476-1122 Cited by: §1.
  • R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik (2018) Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science 4 (2), pp. 268–276. External Links: Document, 1610.02415, ISSN 23747951, Link Cited by: §1.2, §1.
  • S. K. Gottipati, B. Sattarov, S. Niu, Y. Pathak, H. Wei, S. Liu, K. M. J. Thomas, S. Blackburn, C. W. Coley, J. Tang, S. Chandar, and Y. Bengio (2020) Learning To Navigate The Synthetically Accessible Chemical Space Using Reinforcement Learning. pp. 1–23. External Links: 2004.12485, Link Cited by: §1.
  • S. Grimme (2019) Exploration of chemical compound, conformer, and reaction space with meta-dynamics simulations based on tight-binding quantum chemical calculations. Journal of chemical theory and computation 15 (5), pp. 2847–2862. Cited by: §2.1, §2.2, §3.2.
  • G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias, and A. Aspuru-Guzik (2017) Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. arXiv:1705.10843. External Links: 1705.10843, Link Cited by: §1.
  • S. R. Heller, A. McNaught, I. Pletnev, S. Stein, and D. Tchekhovskoi (2015) InChI, the IUPAC International Chemical Identifier.. Journal of cheminformatics 7, pp. 23. External Links: Document, ISSN 1758-2946 Cited by: §1.1.
  • J. Hoffmann, L. Maestrati, Y. Sawada, J. Tang, J. M. Sellier, and Y. Bengio (2019) Data-Driven Approach to Encoding and Decoding 3-D Crystal Structures. External Links: 1909.00949, Link Cited by: §1.1.
  • M. Hoffmann and F. Noé (2019) Generating valid Euclidean distance matrices. External Links: 1910.03131, Link Cited by: §1.1.
  • [37] Home | AI Cures. External Links: Link Cited by: §2.1.
  • F. Imrie, A. R. Bradley, M. van der Schaar, and C. M. Deane (2020) Deep Generative Models for 3D Linker Design. Journal of chemical information and modeling 60 (4), pp. 1983–1995. External Links: Document, ISSN 1549960X Cited by: §1.1.
  • J. Ingraham, A. Riesselman, C. Sander, and D. Marks (2019) Learning Protein Structure with a Differentiable Simulator. In International Conference on Learning Representations, Cited by: §1.
  • W. Jin, R. Barzilay, and T. Jaakkola (2018) Junction Tree Variational Autoencoder for Molecular Graph Generation. In Proceedings of the 35th International Conference on Machine Learning, External Links: Document, 1802.04364, ISBN 9781510855144, ISSN 1938-7228, Link Cited by: §1.
  • I. Y. Kanal, J. A. Keith, and G. R. Hutchison (2018) A sobering assessment of small-molecule force field methods for low energy conformer predictions. International Journal of Quantum Chemistry 118 (5), pp. e25512. External Links: Document, ISSN 00207608, Link Cited by: §2.2.
  • S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley (2016) Molecular graph convolutions: moving beyond fingerprints. Journal of Computer-Aided Molecular Design 30 (8), pp. 595–608. External Links: Document, 1603.00856, ISSN 0920-654X, Link Cited by: §1.
  • Y. Kim and W. Y. Kim (2015) Universal structure conversion method for organic molecules: from atomic connectivity to three-dimensional geometry. Bulletin of the Korean Chemical Society 36 (7), pp. 1769–1777. Cited by: §S1.
  • J. Klicpera, J. Groß, and S. Günnemann (2020) Directional Message Passing for Molecular Graphs. External Links: 2003.03123, Link Cited by: §1.
  • Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia (2018) Learning Deep Generative Models of Graphs. arXiv:1803.03324 arXiv:1803. External Links: Document, 1803.03324, ISSN 2326-8298, Link Cited by: §1.
  • P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert (2004) Extensions of marginalized graph kernels. In Proceedings of the twenty-first international conference on Machine learning, pp. 70. Cited by: §3.1.2.
  • E. Mansimov, O. Mahmood, S. Kang, and K. Cho (2019) Molecular Geometry Prediction using a Deep Generative Graph Neural Network. Scientific Reports 9 (1), pp. 1–13. External Links: Document, 1904.00314, ISSN 20452322 Cited by: §1.1.
  • Ł. Maziarka, T. Danel, S. Mucha, K. Rataj, J. Tabor, and S. Jastrzȩbski (2020) Molecule Attention Transformer. External Links: 2002.08264, Link Cited by: §1.1.
  • A.D. McNaught and A. Wilkinson (Eds.) (2009) IUPAC Compendium of Chemical Terminology (the "Gold Book"). 2nd edition, Blackwell Scientific Publications, Oxford. External Links: Document, ISBN 0-9678550-9-8 Cited by: §1.1, §1.1.
  • D. Mendez, A. Gaulton, A. P. Bento, J. Chambers, M. De Veij, E. Félix, M. P. Magariños, J. F. Mosquera, P. Mutowo, M. Nowotka, M. Gordillo-Marañón, F. Hunter, L. Junco, G. Mugumbate, M. Rodriguez-Lopez, F. Atkinson, N. Bosc, C. J. Radoux, A. Segura-Cabrera, A. Hersey, and A. Leach (2018) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Research 47 (D1), pp. D930–D940. External Links: ISSN 0305-1048, Document, Link, Cited by: §1.2.
  • D. L. Mobley and J. P. Guthrie (2014) FreeSolv: A database of experimental and calculated hydration free energies, with input files. Journal of Computer-Aided Molecular Design 28 (7), pp. 711–720. External Links: Document, ISSN 15734951 Cited by: §1.2.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §3.1.2.
  • F. Noé, S. Olsson, J. Köhler, and H. Wu (2019)

    Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning

    Science 365 (6457), pp. eaaw1147. External Links: Document, 1812.01729, ISSN 10959203 Cited by: §1.
  • N. M. O’Boyle, T. Vandermeersch, C. J. Flynn, A. R. Maguire, and G. R. Hutchison (2011) Confab - Systematic generation of diverse low-energy conformers. Journal of Cheminformatics 3 (1), pp. 8. External Links: Document, ISSN 17582946, Link Cited by: §2.2.
  • M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen (2017) Molecular De Novo Design through Deep Reinforcement Learning. arXiv:1704.07555. External Links: Document, 1704.07555, ISSN 1758-2946, Link Cited by: §1.
  • D. Polykovskiy, A. Zhebrak, B. Sanchez-Lengeling, S. Golovanov, O. Tatanov, S. Belyaev, R. Kurbanov, A. Artamonov, V. Aladinskiy, M. Veselov, A. Kadurin, S. Nikolenko, A. Aspuru-Guzik, and A. Zhavoronkov (2018) Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. arXiv:1811.12823. Cited by: §1.2.
  • M. Popova, O. Isayev, and A. Tropsha (2018) Deep reinforcement learning for de novo drug design. Science Advances 4 (7), pp. eaap7885. External Links: Document, ISSN 2375-2548, Link Cited by: §1.
  • R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld (2014) Quantum chemistry structures and properties of 134 kilo molecules.. Scientific data 1, pp. 140022. External Links: Document, ISSN 2052-4463, Link Cited by: §1.2.
  • B. Ramsundar, P. Eastman, P. Walters, V. Pande, K. Leswing, and Z. Wu (2019) Deep learning for the life sciences. O’Reilly Media. Note: Cited by: §1.2, §1.
  • [60]

    RDKit: Open-source cheminformatics

    Note: Cited by: §S1, §2.2.
  • D. Rogers and M. Hahn (2010) Extended-connectivity fingerprints. Journal of chemical information and modeling 50 (5), pp. 742–754. Cited by: §3.2.
  • [62] SchNetPack - Deep Neural networks for Atomistic systems. Note: 2018-10-23 Cited by: §3.1.3.
  • K. Schütt, P. Kindermans, H. E. S. Felix, S. Chmiela, A. Tkatchenko, and K. Müller (2017)

    SchNet: a continuous-filter convolutional neural network for modeling quantum interactions

    In Advances in neural information processing systems, pp. 991–1001. Cited by: §1.1, §3.1.3.
  • K. T. Schütt, H. E. Sauceda, P. Kindermans, A. Tkatchenko, and K. Müller (2018) SchNet–A deep learning architecture for molecules and materials. The Journal of Chemical Physics 148 (24), pp. 241722. Cited by: §S2.2, §3.1.3.
  • C. H. Schwab (2010) Conformations and 3D pharmacophore searching. Vol. 7, Elsevier Ltd. External Links: Document, ISSN 17406749 Cited by: §2.2.
  • D. Schwalbe-Koda and R. Gómez-Bombarelli (2019) Generative Models for Automatic Chemical Design. arXiv:1907.01632. External Links: 1907.01632, Link Cited by: §1.
  • M. H. S. Segler, M. Preuss, and M. P. Waller (2018) Planning chemical syntheses with deep neural networks and symbolic {AI}. Nature 555 (7698), pp. 604–610. External Links: Document, ISSN 0028-0836, Link Cited by: §1.
  • G. N. C. Simm, R. Pinsler, and J. M. Hernández-Lobato (2020) Reinforcement Learning for Molecular Design Guided by Quantum Mechanics. External Links: 2002.07717, Link Cited by: §1.1.
  • J. S. Smith, O. Isayev, and A. E. Roitberg (2017) ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. Chemical Science 8 (4), pp. 3192–3203. External Links: Document, 1610.08935, ISSN 2041-6520, Link Cited by: §1.2, §3.1.3.
  • J. S. Smith, O. Isayev, and A. E. Roitberg (2017) ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules. Scientific Data 4, pp. 170193. External Links: Document, ISSN 2052-4463, Link Cited by: §1.2, §3.1.3.
  • J. S. Smith, B. Nebgen, N. Lubbers, O. Isayev, and A. E. Roitberg (2018)

    Less is more: Sampling chemical space with active learning

    Journal of Chemical Physics 148 (24), pp. 241733. External Links: Document, 1801.09319, ISSN 00219606, Link Cited by: §1.2.
  • D. C. Spellmeyer, A. K. Wong, M. J. Bower, and J. M. Blaney (1997) Conformational analysis using distance geometry methods. Journal of Molecular Graphics and Modelling 15 (1), pp. 18–36. External Links: Document, ISSN 10933263 Cited by: §2.2.
  • T. Sterling and J. J. Irwin (2015) ZINC 15–Ligand Discovery for Everyone.. Journal of chemical information and modeling 55 (11), pp. 2324–37. External Links: Document, ISSN 1549-960X Cited by: §1.2.
  • M. Stieffenhofer, M. Wand, and T. Bereau (2020) Adversarial Reverse Mapping of Equilibrated Condensed-Phase Molecular Structures. External Links: 2003.07753, Link Cited by: §1.1.
  • J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair, S. French, L. A. Carfrae, Z. Bloom-Ackerman, et al. (2020) A deep learning approach to antibiotic discovery. Cell 180 (4), pp. 688–702. Cited by: §1, §2.1.
  • N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley (2018) Tensor field networks: Rotation- and translation-equivariant neural networks for 3D point clouds. External Links: 1802.08219, Link Cited by: §1.
  • [77] V. Tokars and A. Mesecar QFRET-based primary biochemical high throughput screening assay to identify inhibitors of the SARS coronavirus 3C-like Protease (3CLPro). Note: Cited by: §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §S3.
  • R. Wang, X. Fang, Y. Lu, and S. Wang (2004) The PDBbind database: Collection of binding affinities for protein-ligand complexes with known three-dimensional structures. Journal of Medicinal Chemistry 47 (12), pp. 2977–2980. External Links: Document, ISBN 0022-2623, ISSN 00222623, Link Cited by: §1.2.
  • W. Wang, S. Axelrod, and R. Gómez-Bombarelli (2020) Differentiable Molecular Simulations for Control and Learning. (February). External Links: 2003.00868, Link Cited by: §1.
  • W. Wang and R. Gómez-Bombarelli (2019) Coarse-graining auto-encoders for molecular dynamics. npj Computational Materials 5 (1), pp. 125. External Links: Document, ISBN 4152401902615, ISSN 2057-3960, Link Cited by: §1.1.
  • D. Weininger (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Modeling 28 (1), pp. 31–36. External Links: Document, ISSN 1549-9596, Link Cited by: §1.1.
  • Z. Wu, B. Ramsundar, Evan~N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande (2018) {MoleculeNet}: a benchmark for molecular machine learning. Chem. Sci. 9 (2), pp. 513–530. Cited by: §1.2.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21. External Links: Document, 1901.00596, Link Cited by: §1.1.
  • K. Yang, K. Swanson, W. Jin, C. Coley, P. Eiden, H. Gao, A. Guzman-Perez, T. Hopper, B. Kelley, M. Mathea, et al. (2019) Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling 59 (8), pp. 3370–3388. Cited by: §3.1.1, §3.1.1, §3.1.2, §3.1.3.
  • K. Yang, K. Swanson, W. Jin, C. Coley, P. Eiden, H. Gao, A. Guzman-Perez, T. Hopper, B. Kelley, M. Mathea, A. Palmer, V. Settels, T. Jaakkola, K. Jensen, and R. Barzilay (2019) Analyzing Learned Molecular Representations for Property Prediction. Journal of Chemical Information and Modeling 59 (8), pp. 3370–3388. External Links: Document, 1904.01561, ISSN 15205142, Link Cited by: §1.
  • M. Zampieri, M. Zimmermann, M. Claassen, and U. Sauer (2017) Nontargeted metabolomics reveals the multilevel response to antibiotic perturbations. Cell reports 19 (6), pp. 1214–1228. Cited by: §2.1.
  • A. Zhavoronkov, Y. A. Ivanenkov, A. Aliper, M. S. Veselov, V. A. Aladinskiy, A. V. Aladinskaya, V. A. Terentiev, D. A. Polykovskiy, M. D. Kuznetsov, A. Asadulaev, Y. Volkov, A. Zholus, R. R. Shayakhmetov, A. Zhebrak, L. I. Minaeva, B. A. Zagribelnyy, L. H. Lee, R. Soll, D. Madge, L. Xing, T. Guo, and A. Aspuru-Guzik (2019) Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nature biotechnology 37 (9), pp. 1038–1040. External Links: Document, ISBN 4158701902, ISSN 1546-1696, Link Cited by: §1.

Supplementary Material

S1 Data

There are four datasets, organized by molecule type (drugs or QM9), and whether they contain the original CREST information (drugs_crude.msgpack.tar.gz and qm9_crude.msgpack.tar.gz) or post-processed feature information (drugs_featurized.msgpack.tar.gz and qm9_featurized.msgpack.tar.gz). The notebook tutorial at 27 explains how each dataset is organized and how to extract the data.

To explain the necessity of the featurized files, consider that CREST simulations allow for reactivity, so not all geometries arising from a calculation correspond to the same molecular graph that they started with. Indeed, we have found a number of simulations in which bonds are broken and re-formed, as in tautomerization. For this reason it was necessary to examine each geometry individually after simulation to determine its molecular graph. To this end we used a locally modified version of xyz2mol Kim and Kim [2015] (code accessed from 14) to generate an RDKit mol object 60. The mol object’s graph information is contained in the featurized files through dictionaries of atom and bond features. The SMILES string generated by xyz2mol, as well as the canonical form of this string, are also given.

Additionally, for 3D-based machine learning models, each convolution aggregates atomic information from atoms within a given cutoff radius . The most efficient method of storing this information is to generate a so-called neighbor list for each atom before training. The neighbor list consists of a set of pairs of indices, each of which corresponds to two atoms that are within of each other. We included a neighbor list with Å in the featurized files.

Two notes are in order:

  1. Both the SMILES from xyz2mol and its corresponding canonical form may be different from the original SMILES. This may be because the graph contains more information than the original SMILES (e.g., because the original did not specify stereochemistry, meaning that one random stereoisomer was chosen to seed the CREST simulations), because the SMILES strings are resonance structures, or because the connectivity and bond types are different. The latter is the case when a chemical reaction occurs, such as in the case of tautomerism.

  2. Not all conformers could be successfully converted to graphs, and so the featurized files contain fewer SMILES strings than the crude files.

S2 Hyperparameter optimization

Our approach was to optimize dropout rates and use published values for other hyparameters when possible. This was based on our experience with smaller datasets, which showed that dropout rates were the most important factor for 3D models and models with non-learnable fingerprints. In all cases we optimized the natural logarithm of the dropout rate with SigOpt Clark and Hayes [2019], using a data subset of 60,000, an optimization budget of 20, and a train/validation/test split of 80/10/10. The allowed range of was [-5, 0] in all cases. The dropout rate with the lowest test error was selected. The dropout rates were optimized separately for each architecture and for each prediction quantity.

Hyperparameter value
Hidden state dimension 300
Readout layers 2
Convolutions 3
Activation ReLU
Table S1: Fixed ChemProp hyperparameters.
Model dropout dropout dropout
ChemProp 0.008 0.015 0.018
ChemProp + Morgan 0.097 0.309 0.224
ChemProp + Weighted E3FP 0.057 0.115 0.165
Morgan 0.062 0.142 0.078
Weighted E3FP 0.021 0.033 0.232
Table S2: Optimized dropout rates for ChemProp and its variants.
Hyperparameter value
Atomic fingerprint length 64
Molecular fingeprint length 300
Gaussian spacing 0.1 Å
Cutoff radius 5 Å
Convolution activation Shifted softplus
Convolutions 2
Readout layers 2
Readout activation ReLU
Table S3: Fixed SchNet hyperparameters.
Model dropout dropout dropout
Convolution dropout 0.020 0.015 0.008
Readout dropout 0.007 0.031 0.009
Table S4: Optimized SchNet dropout rates.
Model dropout dropout dropout
Convolution dropout 0.255 0.076 0.020
Readout dropout 0.033 0.007 0.100
Table S5: Optimized ChemProp3D dropout rates.

s2.1 ChemProp

We used the default architecture values given in 11 and shown in Table S1. The hidden state dimension in the two readout layers was reduced according to 300 300 1. A dropout layer was placed after the activation functions following and (see main text), and before the linear layers in the readout phase, as implemented in 11. The optimized dropout rates are shown in Table S2. Note that hyperparameters were optimized for the prediction of , rather than . These parameters were then used for models predicting and for models predicting . We only reported the prediction of in the main text as the prediction performance was far better.

s2.2 SchNet

The fixed SchNet hyperparameters are given in Table S3. The atomic fingerprint length, convolution activation, Gaussian spacing, and number of readout layers are all those given in Schütt et al. [2018]. The original SchNet paper used three convolutions and a cutoff radius of 10 Å. However, because ChemProp used only three convolutions, and because bond distances are typically under 2 Å, we reduced the cutoff radius to 5 Å and used only two convolutions. Also as in ChemProp we used the ReLU activation for the readout and a molecular fingerprint length of 300. Since the atomic and molecular fingerprints had different lengths, a single linear layer was used to convert the summed atomic fingerprints to a molecular fingerprint. As in the original SchNet paper, the fingperint dimension was reduced in the readout layers according to .

Dropout layers were placed before the linear layers in the readout phase and before linear layers in the convolution phase. The dropout rates were optimized separately. Optimized values are shown in Table S4.

s2.3 SchNet25

The same hyperparameters were used for SchNet25 as for SchNet, with the exception of the molecular fingerprint. This was reduced from 300 to 64 to reduce memory use. Dropout rates optimized for SchNet were also used for SchNet25.

s2.4 ChemProp3D

We used an identical architecture to that of ChemProp, with the addition of 50 Gaussians, a linear layer, and the ReLU activation to map distances to edge features. The linear layer and ReLu activation converted the 50 Gaussians into an edge feature vector of length 64. We also used the SchNet approach in the readout layer. Dropout layers were placed before the linear layers in the readout phase and before linear layers in the convolution phase. The dropout rates were optimized separately. Optimized values are shown in Table S5.

S3 Training

In all cases we used the Adam optimizer and mean square error loss for training. For ChemProp we used the default training hyperparameters given in 11. The learning rate scheduler described in Vaswani et al. [2017] was used with an initial and final learning rate of , a maximum learning rate of , two warmup epochs, 30 total epochs, and a batch size of 50. We verified that performance did not improve when using more than 30 epochs.

For all 3D models we used a batch size of 25 and an initial learning rate of

. We used a scheduler that decreased the learning rate by half if validation performance had not improved in 10 epochs. 100 epochs were used in all models except for SchNet25, which required 200 epochs for convergence. Finally, the presence of a small number of outlier geometries initially led to divergences in the SchNet25 training. To account for this, the optimizer did not take a step if the batch loss was divergent. Our reported results in the main text exclude divergent predictions.