The goal of drug discovery is to find novel molecules that bind to target proteins of pharmacological interest. This involves a long, costly pipeline in which a massive chemical space is repeatedly filtered into the subset of compounds most likely to be active. Computational methods promise to streamline this process through rapid scoring and ranking of large libraries of molecules in silico prior to hit confirmation with experimental assays [Seifert2003]. This approach, called virtual screening, has established a significant presence in the modern drug development toolkit.
Despite its utility, virtual screening has inherent limitations. Though faster and cheaper than high-throughput experimental screening, it cannot propose compounds absent from existing libraries or suggest changes to input compounds that improve their desired properties. The need to explicitly enumerate molecules for screening means that even sub-linear time virtual screening algorithms [sunseri2016pharmit] cannot cover the entirety of chemical space.
A new class of machine learning methods called generative models have been introduced that could overcome these limitations. Virtual screening falls within the category of discriminative models, which are designed to predict some desired property, such as binding affinity, conditioned on the input data. In a generative modeling formulation, the goal instead becomes learning to directly sample from the underlying distribution of drug-like molecules in order to yield novel active compounds.
Generative models for computational drug design have thus far learned to map from two-dimensional molecules to a continuous latent space that is then explored to produce new molecules with desired properties. In this work, we describe for the first time a deep learning model that represents and generates diverse three-dimensional molecular structures. We encode 3D molecules to a latent space using atomic density grids as input and decode them in the same format. Then we apply a novel optimization algorithm to fit 3D structures to generated atomic densities and reliably produce valid molecules.
2 Related Work
In the last decade, major advances in computer vision[krizhevsky2012imagenet, lecun2015deep] have sparked a surge of interest in deep learning for drug discovery [angermueller2016dl]. Researchers have applied deep learning to binding discrimination for virtual screening [wallach2015atomnet, ragoza2017cnn] and pose ranking for molecular docking [ragoza2017cnn]
by using atomic density grids to treat classification and regression of molecular structures as analogous to three-dimensional image recognition. Other work has trained deep neural networks for binding affinity prediction[gomes2017, jimenez2018kdeep]
and quantum energy estimation[schutt2017] using atomic coordinate-based input representations.
Deep learning for generative modeling generally falls into two frameworks. In a variational autoencoder (VAE)[kingma2014vae]
, an encoder and decoder network are trained to map between the input space and a set of latent random variables that follow a predetermined distribution. This allows the model to sample from the prior and create novel data samples. Generative adversarial networks (GANs)[goodfellow2014gan] are a complementary approach [larsen2016vaegan]
that concurrently train a discriminative network to distinguish real and fake data, and a generative network to create samples that the discriminator classifies as real. GANs have proven successful at learning to generate novel image-like data.[radford2016dcgan]
The first efforts applying deep generative models to molecule generation [gomez-bombarelli2016, segler2017lm] used the SMILES string format [weininger1988smiles]. Treating chemicals as linguistic sequences allows the use of recurrent networks that have been effective at modeling natural language. However, similar molecules can have very different SMILES strings due to lack of permutation invariance, and invalid SMILES strings can be produced unless grammatical constraints are imposed on the model [kusner2017gramvae, dai2018sdvae]
or valid strings are rewarded through reinforcement learning[guimaraes2018organ]. Lastly, SMILES strings only describe molecules in terms of their connectivity and cannot represent their 3D spatial conformations.
Others have trained deep generative models using molecular graphs [simonovsky2018graphvae, decao2018molgan, jin2019jtvae, kwon2019graph] by leveraging graph convolutions [kearnes2016graphconv]. These more naturally capture many invariances present in chemical structures, but require expensive or approximate graph matching algorithms to compare output to input molecules. Molecular graphs also suffer from the the possibility of invalid outputs, so reinforcement learning has been used with them as well [decao2018molgan, kwon2019graph]. Most work with molecular graphs has been confined to two dimensions, though forays have been made into generating them along with distance matrices [gebauer2018distmat, hoffmann2019distmat]. Though highly promising, molecular graph-based models have so far only been evaluated at generating different constitutional isomers of a single, fixed set of atom types.
Atomic density grids have several advantages over generating other molecular representations, as well as some limitations. Grids can process as many atoms as can fit within their bounds in constant time and memory, while SMILES strings and molecular graphs scale linearly and quadratically with the number of atoms. Grids naturally facilitate learning from 3D spatial conformations using convolutions [ragoza2017cnn]. They are also permutation invariant, though not rotation invariant. Finally, they must be converted into a discrete representation to generate valid molecules. Previous efforts at generating atomic densities overcame this challenge by using a recurrent captioning network to covert them to SMILES strings [skalic2019shapevae]. However, this relinquishes the 3D nature of the generative model. Rather than construct 2D molecules from 3D atomic densities, we instead solve for discrete 3D molecules that best fit atomic densities through optimization.
3.1 Representing molecules as atomic density grids
We represent molecules in a grid format amenable to convolutional network training that was first described in [ragoza2017cnn]. To convert molecules to grids, we assign each atom a type based on its element, aromaticity, hydrophobicity, and hydrogen donor/acceptor state. Then, the atoms are represented as continuous, Gaussian-like densities on a 3D grid with separate channels for each atom type, akin to an image with different channels for each color. We compute the values in each atom type channel by summing over the density of each atom with the corresponding type at each grid point. Grids span a cube of side length 23.5 with 0.5 resolution, resulting in points per channel. We center them on the input molecule and compensate for lack of rotation invariance by randomly rotating during training. We also evaluate 10 rotations of each molecule in the validation and test sets. This is facilitated by computing grids on-the-fly using libmolgrid
, an open-source, GPU-accelerated molecular gridding library[sunseri2020libmolgrid].
3.2 Constructing molecules from atomic density grids
The inverse problem of converting density grids back into discrete molecular structures does not have an analytic solution, and instead must be solved through optimization. libmolgrid
allows us to compute the grid representation of a partial atomic structure and backpropagate a gradient from grid values to atom coordinates. We can also estimate the best locations to place atoms on a density grid by selecting from the grid points with the-largest density values. Putting this together, we devised an iterative solver algorithm that combines beam search (in which only the current top- structures are stored and expanded) with gradient descent on atomic coordinates, finding a set of atoms that best fits a reference density.
This algorithm returns atom types and coordinates, but further processing is required to produce a valid molecule. We accomplish this by adding bond information based on the returned atom types and geometry. We customized functions from OpenBabel [oboyle2011openbabel, openbabel] in order to infer connectivity and bond order while respecting the constraints of our atom typing scheme. Depending on the atom type, these constraints include aromatic ring membership, presence of bonds to hydrogen and heteroatoms, and formal charge. The validity of the resulting molecules are assessed using criteria from RDKit–a molecule is valid if and only if it is composed of a single, connected fragment and raises no errors when passed to the RDKit function SanitizeMol [rdkit].
3.3 Generative model architectures and training
We trained deep generative models using atomic density grids as both input and output. Generators were trained as convolutional autoencoders to reconstruct their input by minimizing the squared error, or L2 loss, of the output density with respect to the input density. In addition, we trained our models as GANs, using an adversarial discriminative network to encourage the generated densities to match the overall distribution of real densities. Both standard (AE) and variational (VAE) autoencoders were trained as the generator component. Their architectures were identical except that the VAE had a probabilistic latent space that was trained to follow a standard normal distribution by a Kullback-Liebler (KL) divergence loss function. The VAE combined its KL divergence, L2, and adversarial loss functions with weights of 0.1, 1.0, and 10.0 respectively, while the AE weighted its L2 and GAN losses equally at 1.0. Models were trained using the Adam optimization algorithm[kingma2019adam]
with hyperparametersand a fixed learning rate policy for 100,000 iterations, using a batch size of 16. The full details of our model architectures are found in Figure 7.
3.4 Data sets
In order to train deep neural networks to generate molecules in 3D, we downloaded the full set of commercially available stock compounds from the MolPort database [molport] and filtered out any that failed our atom typing scheme, yielding 7,559,852 distinct small molecules. Up to 20 conformers were generated for each compound using RDKit, with an average of 14 conformations per compound. Out of the 108,878,042 resulting molecular structures, we randomly held out 10% in a validation set for model optimization, while the remaining examples were used for training.
Models were evaluated on an independent test set of compounds collected from PubChem [pubchem2019] with varying degrees of similarity to the training set. Any molecules that had more than one disconnected fragment, had molecular weight > 1,000 , or had elements not found in our atom typing scheme were excluded. The remaining compounds were then sorted into bins based on their maximum similarity to any molecule in the training set, using the Tanimoto coefficient between OpenBabel FP2 fingerprints. One thousand molecules were retained in each similarity bin of width 0.1 from 0.1 to 1.0, and 14 compounds were found with zero similarity to any training set molecule. A single conformer was generated for each test set molecule using RDKit.
3.5 Evaluation procedure
Our general approach consisted of a pipeline through successive molecular representations, as depicted in Figure 1. The main pathway through the pipeline involved first typing and gridding real molecules to produce real densities. Next these were encoded into latent vectors using either the standard or variational autoencoder, then decoded as generated densities (AE posterior, VAE posterior). Densities were also generated by decoding latent vectors from a standard normal distribution with the variational autoencoder (VAE prior). Atom fitting was applied to generated densities to produce atom types and coordinates, followed by bond adding to yield valid output molecules.
Alternate pathways through our pipeline served as baselines that allowed us to assess the performance of each stage in isolation. We fitted atom types and coordinates to real density grids in order to evaluate atom fitting independently from the generative models (Real density). This also established an upper bound on our expected performance in atom-level metrics. We did the same for evaluating bond adding and molecule-level metrics by adding bonds to real atom types (Real atom types).
We employed a variety of metrics based on the different representations to evaluate our methods. We used L2 loss to directly compare generated densities to real densities and assess convergence while training our neural networks. After fitting atoms to real or generated densities, we considered atom type count differences and RMSD with respect to the real atom types and coordinates. Once molecules were constructed by adding bonds to atomic structures, we computed their validity, molecular weight, quantitative estimate of drug-likeness score (or QED score, in which higher is more drug-like) [bickerton2012qed], synthetic accessibility score (or SA score, in which lower is more synthesizable) [ertl2009sas] , and Tanimoto similarity to the real input molecule using OpenBabel FP2 fingerprints.
During training, we evaluated each method on 1,000 random molecules from the validation set every 10,000 iterations. We generated output molecules for 10 different random rotations, and in the case of the VAE, different latent samples, of each input molecule. For the VAE prior, which was not conditioned on an input molecule, this simply meant generating 1,000 batches of 10 molecules from different latent samples. Once training converged, we evaluated each method on the different test set similarity bins. Again, we generated output molecules for 10 different random rotations and/or latent samples of each input molecule, or an equivalent number batches from the VAE prior.
We tested our ability to recover input types and coordinates from real densities and those generated by the AE and VAE posteriors using atom fitting. After that we measured the overall validity of the molecules made by assigning bonds to the atomic structures from each method. We then evaluated the distributions of properties that are relevant to drug development on the valid generated molecules. Finally, we visualized some examples of sampling and interpolating between 3D molecules in the latent spaces learned by our generative models.
4.1 Training convergence
Figure 2 shows the training dynamics of our generative models on the validation set. The improving quality of generated densities as measured by L2 loss also resulted in better atom fitting. The number of atom type differences decreased relative to the input structure in both the AE and VAE. By the end of training, the AE matched the atom types exactly with 68% probability, and did so in at least one sample out of 10 for 98% of validation molecules. The VAE recovered the exact atom types 15% of the time overall, and did so in at least one sample for 61% of the validation set.
In all generative modeling methods, the number of fit atoms increased with training. The AE posterior converged to the average size of molecules in the true data distribution. The VAE posterior tended to produce slightly smaller molecules than the AE while the prior molecules were significantly smaller.
The validity shown in Figure 2 uses unmodified versions of OpenBabel ConnectTheDots and PerceiveBondOrders functions on the atom types and coordinates instead of our full bond adding routine. These do not connect distant fragments with long bonds or use atom types to constrain bond adding, so this metric can be considered a simple proxy for geometric validity of the coordinates produced by atom fitting. Both the AE and VAE posterior molecules tended to produce increasingly valid coordinates over time, while the prior molecules had low geometric validity for most of training.
The AE and VAE posterior generated increasingly similar molecules to the input as training progressed. After 100,000 iterations, the AE produced molecules with a Tanimoto similarity of 0.61 on average, and in a batch of 10 samples the most similar molecule had an expected similarity of 0.88. Around 54% of the validation set molecules were perfectly reconstructed in at least one sample from the AE. The VAE posterior molecules had an average similarity of 0.32 to their input molecule and less than 10% yielded the same exact molecule in any sample.
The QED score of valid molecules created by the AE approached the mean of the validation set by iteration 100,000. The drug-likeness of the VAE posterior increased during training as well. For both the AE and VAE posterior, the most drug-like output sample had a higher mean QED score than the input molecule. The QED scores of prior molecules converged lower than the posterior molecules, but the the maximum QED score per batch was in the range of the posterior molecules.
4.2 Atom fitting reconstruction
Our capacity to reconstruct atom types and coordinates of test set molecules from real and generated densities is shown in Figure 3. The mean L2 loss of AE and VAE posterior densities on the test set was around 14 and 27 respectively. There was little correlation with the test set similarity bin, suggesting that the neural networks were not overfit to the training set.
Fitting atoms to real densities produced an average type difference of 0.11 compared to the real atom types, and the exact input types were reconstructed with 98% probability. The number of type differences from fitting to generated densities was 1.9 for the AE and 4.2 for the VAE posterior on the test set, which translated into recovering the exact types from 47% of AE densities and 10% of VAE posterior densities. The exact input types were recovered in at least one sample for 70% of test set molecules using the AE and 34% with the VAE posterior. Atom type differences increased on generated densities when input molecules were highly dissimilar to the training set.
Whenever the input atom types were recovered exactly by atom fitting, we computed an RMSD to the real atoms by finding the minimum-distance assignment between atoms of the same type. The average RMSD was 0.011 for atoms fit to real densities, 0.28 when fitting to AE densities, and 0.47 when fitting to VAE posterior densities. This indicates nearly identical coordinates in every case, with slightly more variation from the VAE.
4.3 Molecular validity
Once atom fitting was validated, we assessed the ability to produce valid molecules by adding bonds to atom types and coordinates from each method, using the test set as input. We were able to produce valid molecules from over 99% of real atom types and atoms fit to real densities. Using our generative models, we created valid molecules from 92% of AE densities, 91% of VAE posterior densities, and 91% of VAE prior densities. At least one valid sample was generated for 98% of test set molecules using the AE, 99% using the VAE posterior, and for 100% of prior batches. The most common reason for invalid molecules was an atom having a greater valence than allowed, followed by failing to assign discrete bond orders to ring systems marked aromatic.
4.4 Properties of valid molecules
Figure 4 shows drug property distributions of valid molecules from our generative models on the test set compared with baseline methods. In addition to using the real test set molecules as a baseline, we included adding bonds to real atom types and coordinates. The output molecules from real atom types had mean Tanimoto similarity greater than 0.82, and a median of 1.0, to the input molecule. The similarity to the input declined as the input molecules became less similar to the training set. Despite this, the distributions of molecular weight, QED score, and synthetic accessibility of molecules from adding bonds to real atoms were all highly matched with the real test set molecules.
The expected Tanimoto similarity of AE posterior molecules to their input was 0.43, and the most similar output sample was around 0.63. VAE posterior molecules had a lower expected similarity of 0.24, with the most similar sample expected at 0.43. The similarity of VAE prior molecules to random test set molecules was only 0.13. The output molecules from every generative method decreased in similarity to the input as it became less similar to the training set, following the same trend seen in bond adding to real atoms.
The molecules generated from the AE and VAE posterior were similar in size to real molecules by molecular weight, though slightly smaller on average. VAE prior molecules were significantly smaller than either VAE posterior or real molecules. When the input molecule had less than 0.3 similarity to the training set, the size of the output molecules from the AE and VAE posterior fell to around that of the prior distribution. On all other test set bins, the molecular weight of posterior molecules was highly correlated with the size of the input.
The mean QED score of AE posterior samples was 0.49, which is near the drug-likeness of real molecules in our test set. The VAE posterior and prior molecules had slightly lower expected QED scores of around 0.43 and 0.38. The most drug-like output molecule in ten samples had higher expected QED than the input molecule, using either the AE or VAE posterior. This trend held across every test set bin except molecules with 0.0 similarity to the training set. The SA score of molecules from the generative models tended slightly higher than real test set molecules, indicating less synthesizable molecules. However, they were more synthetically accessible than the molecules in the highly dissimilar test bins.
4.5 Latent sampling and interpolation
Finally, we exemplify two potential use cases of the continuous latent spaces in which our generative models learned to represent valid molecules. Figure 5 shows examples of molecular structures produced by drawing 10 random samples from either the VAE posterior or prior and selecting the sample with the highest QED score. Each molecule is shown in 2D and 3D, with the 3D structure minimized using UFF [rappe1992uff]. The unminimized 3D conformation produced by the generative model is overlaid with transparency. 80% of the VAE posterior examples had a higher QED score than their input. The amount that the VAE posterior outputs were modified compared to the input varied widely–some examples changed only a single bond or functional group, while others reconfigured the entire molecule. There was a clear tendency to alter rings and scaffolds in unusual ways, creating structures that bear a high degree of shape similarity but little topological resemblance. On the other hand, the VAE prior molecules were small, fragment-like structures with a high density of polar functional groups but not much scaffold diversity.
In Figure 6, four different interpolations were performed between different molecules or different poses of the same molecules through the latent spaces learned by our generative models. The AE interpolations were performed linearly, while the VAE interpolations used spherical linear interpolation. There is some degree of chemical structure maintained throughout the interpolations. For instance, the sulfonyl and amide groups are present in almost every step of the AE pose transition. Most of the transformations maintain overall shape continuity rather than topological or chemical similarity. This is evidenced by aromatic densities that smoothly change size and position translating into aromatic rings that become smaller, non-aromatic rings or alkane branches.
By training deep generative models on atom density grids and applying our atom fitting and bond adding algorithms, we have developed the first deep learning model for generating diverse three-dimensional molecular structures. Our models generate valid molecules at a rate greater than 90%, but we realized that validity may be an insufficient metric for assessing generative models of 3D molecules since it ignores geometry and energetic favorability. Molecules that were made valid by adding strained bonds to ensure a single connected fragment often had high energy. This highlights the need for different evaluation criteria for generating 3D molecules than those used for 2D molecules.
Our standard AE was successful at exactly reconstructing its input atom types and coordinates in about half of test set densities, but fell off when evaluating molecules that were increasingly dissimilar to the training set, as seen in Figure 3. This was not attributable to atom fitting, since we were able to reconstruct the exact atom types in 98% of real densities regardless of the test set bin and attain extremely low RMSD. On the contrary, the disparity in fitting to generated densities was reflected in the L2 loss of those densities, implying that future work should focus on improving the quality of the underlying generative models. It could also be a sign of overfitting and the need for a stricter validation set, since this was not detected in Figure 2.
The VAE posterior was less likely to reconstruct the input, producing outputs that differed by about 4 atom types on average. This was expected as the VAE should, by design, produce variations of the input, and could be used to propose new drug-like molecular structures related to a given molecule. This was supported by the fact that the maximum QED score of the VAE posterior samples were higher than the QED of the input, as seen in Figure 4. Furthermore, there was no change in the synthetic accessibility of the VAE posterior molecules as the input became less similar to the training set. These observations suggest that the variational posterior had the capability of reliably proposing novel drug-like compounds based on input molecules.
The VAE prior molecules had slightly lower drug-likeness and higher synthetic accessibility than the posterior, in part due to their lower molecular weight. A primary objective in variational models is gaining the ability to sample useful outputs from the prior, and in their current state our VAE prior molecules are too small and structurally simple. Moving forward, an increased emphasis will be placed on transferring the relative success of posterior sampling to the prior, for instance by increasing the KL divergence loss weight over time.
Given that we generated molecules with a Tanimoto similarity of only 0.82 to the real molecule from real atom types and coordinates, there is room for improvement in bond adding. This is likely due to limitations in our atom typing scheme, which lacks the information (e.g. formal charges, hybridization states) necessary to perfectly reconstruct certain molecules. An alternative atom typing scheme that provides better information for bond adding could improve our molecular reconstruction accuracy and the diversity of our novel generated molecules.
This work serves as a proof of concept for generating 3D molecules using deep learning that can be expanded in numerous ways. The ability to map molecular structures to and from a continuous representation introduces the possibility of applying continuous optimization to the generation of desirable molecular structures. While 2D generative models have utilized cheminformatic property optimization, we are currently integrating protein structures into our pipeline to create 3D structure-conditional generative models. These are tasked with learning distributions over bound ligand structures conditioned on a binding pocket. The potential for new directions in 3D generative models for drug discovery are vast, and we hope that their exploration can be accelerated by making this project publicly available at https://github.com/mattragoza/liGAN and https://github.com/gnina.
6 Supplementary Materials
6.1 Model architecture details
6.2 Atomic density and gridding functions
In the molecular grid format, atoms are represented as continuous, Gaussian-like densities on a three-dimensional grid with separate channels for each atom type, analogous to an RGB image. The density of an atom at a grid point is a function of the displacement and the atomic radius:
Given a 3D molecular structure with atoms and possible atom types as a vector of atom type indices and a matrix of atomic coordinates , the grid values in each atom type channel are computed by summing the density of each atom with the corresponding type at each grid point: