Generating novel drug molecules is a daunting task. The task aims at creating new molecules (or, optimize known molecules) with multiple desirable properties, that are often competing and tightly interacting with each other. For example, optimal drug molecules should be easy to synthesize, have binding affinity to a specific target protein of interest, low binding affinity to other targets, while also exhibiting high drug likeliness (QED). This makes drug discovery a costly (2-3 billion USD) and a time-consuming process (more than a decade) with a low success rate (less than 10%).
Traditional in silico
molecule design and screening rely on considerable domain knowledge, physics-based simulations, and heuristic search algorithms. However, optimizing over the discrete, unstructured and sparse molecular space remains an intrinsically difficult challenge. As a result, there is a lot of interest in developing automated machine learning techniques to efficiently discover sizeable numbers of plausible, diverse and novel candidate molecules in the vast (– ) space of molecules [polishchuk2013estimation].
There has been a recent surge in the development of machine learning based models for candidate drug screening and novel molecule design. Simplified Molecular Input Line Entry Specification (SMILES), a text string representation of molecules, has been a popular choice for representing molecules in these models. Whereas earlier approaches to generate molecules involved recurrent neural networks (RNN)[segler2017generating, gupta2018generative]
, recent methods employ generative frameworks, such as the Variational Autoencoder (VAE)[gomez2018automatic, blaschke2018application, kang2018conditional, lim2018molecular] and the Generative Adversarial Network (GAN)[guimaraes2017objective].
There has recently been increasing interest in molecular graph-based generative methods [li2018learning, li2018multi, simonovsky2018graphvae, samanta2019nevae, ma2018constrained, kajino2018molecular]. A particular challenge of SMILES-based generative approaches is to ensure that the generated strings are syntactically valid under SMILES grammar. [kusner2017grammar] and [dai2018syntax] added grammar constraints to SMILES strings to improve the chemical validity of the generated molecules. Unfortunately, this approach also suffers from several shortcomings. For example, the models are not permutation-invariant of their node labels, the training has a quadratic complexity concerning the number of nodes, and generating semantically valid graphs is challenging. [jin2018junction] is considered as a state-of-the-art architecture in this context, which represents a molecular graph as fragments (such as rings and atom branches) connected in a tree structure. Among GAN-based approaches, [de2018molgan]
introduced MolGAN, a GAN trained with reinforcement learning (RL) for generating molecular graphs.
Reinforcement learning (RL) methods or Bayesian Optimization (BO) have often been employed on top of a SMILES-based or Graph-based molecule generator for ensuring valid molecule generation and/or generating molecules with a desired structure or property [popova2018deep, olivecrona2017molecular, jaques2017sequence, putin2018reinforced, Zhavoronkov2019natbio, gomez2018automatic]. Both BO [gomez2018automatic] and RL [zhou2019optimization, Zhavoronkov2019natbio]
incur high computation cost, as they require a large number of evaluations. Semi-supervised learning has also been used for conditional generation[lim2018molecular, kang2018conditional, li2018multi], which needs labels to be available during generative model training.
In order to generate drugs that are specific to a particular target, the generative models in existing works [Zhavoronkov2019natbio, li2018multi] are typically fine-tuned on the subset of molecules that bind with that target. This approach requires model fine-tuning for a single target of interest and also limits the exploratory capability of the model in terms of generation. Accounting for off-target selectivity in a broader context also becomes non-trivial in this setting.
We propose an alternative method named Controlled Generation of Molecules — CogMol, that first learns a model capturing the vast molecular space using a variational autoencoder. We then map known relationships between protein sequences and existing drug molecules by training a binding affinity predictor posthoc on the learned latent features of drug molecules and protein sequence embeddings. The resulting model can handle candidate molecule generation for many target protein sequences, rather than requiring model retraining for every individual target. This approach also allows to perform controlled generation by explicitly taking off-target selectivity into account. Additional controls such as drug-likeliness can be added, to further finetune the controlled generation process.
Designed molecules are further screened using a multi-task deep learning classifier that was trained to predict toxicity on severalin vitro and clinical endpoints. We hope that accounting for the important factors such as selectivity and toxicity within the AI framework will help the in silico drug design process to be faster, more agile, and less costly, leading to shorter discovery pipelines with high success rate.
Given the urgency with the ongoing COVID-19 pandemic, we apply CogMol to generate candidate molecules that bind to one of the three relevant target proteins of the SARS-CoV-2 virus, namely NSP9 Replicase, Main Protease, and the Receptor-Binding Domain (RBD) of the SARS-CoV-2 S protein, and act as their inhibitors (see Figure 1). The generated molecules might serve as a blueprint for creating drugs that can potentially bind to the viral protein with high target affinity, high off-target selectivity as well as high drug-likeliness.
2 Molecule Generation Pipeline
As depicted in Figure 2, our target-specific drug design pipeline includes the following components:
A Variational Autoencoder (VAE), first trained unsupervised and then jointly with attribute regressors (QED and Synthetic Accessibility, SA), that learns a disentangled latent space of the molecules on the SMILES representation.
A set of attribute regression models (predicting QED, off-target selectivity, and target affinity) trained on the VAE latent embeddings.
An attribute-conditioned sampling scheme to generate molecules with desired attributes from the latent space.
An in silico screening of generated molecules using sequence-level toxicity and affinity predictors.
The first three steps constitute CogMol and the last step empowers it with in silico screening.
We used Moses benchmarking dataset [polykovskiy2018molecular] for the unsupervised VAE training, which include 250k/10k/10k molecules (training/test/scaffold test sets) from ZINC database [irwin2005zinc]. These molecules were filtered such that the molecular weight is in the range from 250 to 350 Daltons, the number of rotatable bonds is not greater than 7, and logP is less than or equal to 3.5. Molecules containing charged atoms as well as atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms were removed. The molecules were further filtered via medicinal chemistry filters (MCFs) and PAINS filters.
For target-specific compound design, we used a curated IC50-labeled compound-protein binding data from BindingDB [gilson2015bindingdb], as reported in DeepAffinity [karimi2018deepaffinity]. Since our objective is to build the best binding affinity (pIC50) regression model using available data, we also added the four excluded classes into our training data.
3.2 Variational Autoencoder for Molecule Generation
A Variational Autoencoder (VAE) [kingma2013auto] frames an autoencoder in a probabilistic formalism that constrains the expressivity of the latent space, . Each sample defines an encoding distribution and for each sample, this encoder distribution is constrained to be close to a simple prior distribution
. We consider the case of the encoder specifying a diagonal gaussian distribution only, i.e.with .
The encoder neural network produces the log variances. The marginal posterior is . The standard VAE objective is defined as follows (where
is the Kullback-Leibler divergence),
We used a bidirectional Gated Recurrent Unit (GRU) with a linear output layer as an encoder. The decoder is a 3-layer GRU RNN of 512 hidden dimensions with intermediate dropout layers with dropout probability 0.2. Then, we jointly trained two property predictors, one for QED and one for SA, each parameterized by a feed-forward network, along with the VAE, to predictfrom the latent embedding of . Finally, we trained VAE with QED and SA predictors on BindingDB small molecules. We report the performance of the final VAE model in Table 1.
3.3 Latent Variable-based Predictors
To test the information content of the VAE latent space, we trained multiple attribute (QED, logP, and SA) predictors on the latent embeddings. These regression models have 4 hidden layers with 50 units each and ReLU nonlinearity. The performance of those predictors is reported in Table2, showing low root-mean-square-error (RMSE) on test data for all three attribute predictors.
We also trained a binding affinity regression model using the IC50 data from BindingDB. This model takes a representation of a target protein sequence and latent embedding of a molecule as input and predicts the binding affinity between the protein-molecule pair. We used pretrained protein embeddings from [bepler2018learning]
to initialize the weights for proteins. The protein embedding and the molecular embedding are concatenated and passed through a single hidden layer with 2048 hidden units and ReLU nonlinearity. The best model with a test RMSE of 1.2310 was selected using extensive hyperparameter tuning (Table2). We also trained a binding affinity predictor using SMILES () instead of latent () embedding as the input molecular representation, which yields an RMSE of 0.8426, comparable with recent work [karimi2018deepaffinity].
|Binding Affinity (z)||RMSE|
|Binding Affinity (x)||RMSE|
|Toxicity Prediction||ROC AUC|
3.4 Controlled Generation
Our objective is to generate molecules that simultaneously satisfy multiple (and often conflicting) objectives. Specifically, we want our generated molecules to have a high binding affinity to the selected SARS-CoV-2 target, high drug-likeliness, and high off-target selectivity.
Selectivity to a particular target is often modeled only in the later stages of a drug development pipeline and we believe that incorporating selectivity during the candidate generation stage will contribute to a reduction in the failure rate of drug candidates. We define selectivity as the excess binding affinity of a molecule to a target of interest over its average binding affinities to a random selection of targets [bosc2017use].
We performed conditional sampling using Conditional Latent (attribute) Space Sampling — CLaSS, the method described in [das2020science]. In short, CLaSS leverages the attribute classifiers trained posthoc on the latent space and uses a rejection sampling scheme to generate samples with desired attributes. We added target binding affinity, off-target selectivity, and drug-likeliness as controls in the generation of candidate molecules. The distributions of CogMol-generated molecules in terms of binding affinity and off-target selectivity are displayed in Figure 3. Results indicate that generating high-affinity ligands is more challenging for NSP9 (Figure 3a), while Main Protease ligands are more selective (Figure 3b).
We show five randomly chosen samples from the top 1,000 generated molecules for each of the three SARS-CoV-2 targets in Figure 4.
3.5 In Silico Screening of Generated Molecules for Toxicity
Conventionally, molecular toxicity or side effect testing is carried out via different endpoint experiments such as in vitro molecular assays, in vivo animal testing, and clinical trials. However, these experiments are costly and time-consuming. Thus, as an in silico early screening tool, we used a multitask deep neural network (MT-DNN) to assess the toxicity of the molecules.
The MT-DNN was used to predict the toxicity on 12 in vitro endpoints as in the Tox21 challenge.111https://tripod.nih.gov/tox21/challenge/ We also predicted whether the generated molecules would fail the clinical trials, using the ClinTox data [Wu2018]. This information allows us to prioritize the testing of molecules that are less likely to be harmful and can speed up the process of finding a COVID-19 therapeutic. We will continue to expand this model to include more endpoints.
Since a toxic substance is likely to exhibit toxicity across multiple endpoints, a multitask model can improve the prediction by exploiting the correlation between different endpoints. A multitask model is also more advantageous when the underlying training data is scarce (particularly in a clinical setting).
The MT-DNN model contains a total of four hidden layers: two are shared across all endpoints and two are private for each of the endpoints. We used a dropout [Srivastava2014]
probability of 0.5, and a ReLU activation function for all layers except for the last layers, in which the sigmoid activation was used. Morgan Fingerprints[Rogers2010] were used as the input features to the model.
The ROC AUC, Accuracy (ACC), Balanced Accuracy (BAC), True Negative (TN), True Positive (TP), Precision (PR), Recall (RC), and the F1 score of the MT-DNN model on Tox21 and ClinTox test data are reported in Table 5 in the Appendix. Although the AUC values are slightly worse than the existing work of [Liu2019, see Table S14], the precision (and thus true positive rate) achieved by the MT-DNN is much higher.222The average precision from [Liu2019] over all 13 tasks is 0.45, which was obtained by running their code available through Github.
For comparison, we also report the results from a random forest (RF) model in Table6, showing that the MT-DNN significantly outperforms the RF model in terms of true positive rate, recall, and F1 score. Therefore, the MT-DNN model was used for assessing the generated molecules for toxicity. The molecule was considered toxic if it was predicted to be toxic in 2 endpoints.
4.1 Docking of Generated Molecules with Target Structure
To investigate the possible binding mode of the generated molecules with the target protein, we performed docking of the generated molecules against the potential binding sites (Figure 7) as predicted using PrankWeb [krivak2018p2rank, jendele2019prankweb] within the target structure using Autodock Vina [trott2010autodock]. Table 3 summarizes preliminary results from these docking runs. In the lowest energy docking mode, 88%, 94%, and 96% of generated molecules showed a binding free energy of kcal/mol for at least one of the binding pockets in NSP9 dimer, Main Protease, and RBD, respectively. We also report the average and minimum binding free energy, as well as the fraction of generated molecules with a binding free energy of -6 kcal/mol for each pocket (Table 3). Figure 8 shows the best generated molecule (with lowest binding free energy) docked to a binding pocket in the target structure. These initial results imply the potential of these molecules as inhibitors for the COVID-19 targets, which will be further evaluated.
|Target||Mean (kcal/mol)||Min (kcal/mol)||Low Energy (%)|
|NSP9 Dimer (88%)||pocket 1|
|Main Protease (94%)||pocket 1|
|RBD (96%)||pocket 1|
4.2 Novelty of Generated Molecules
To assess the novelty of generated molecules, we assigned to each molecule a score representing its minimal distance (maximal similarity) to all registered compounds in a reference database of known compounds :
To determine the similarity, we derive structural fingerprints, MACCS keys [maccs_durant], for each pair of molecule () and compound (). The Tanimoto [tanimoto_willett] coefficient between two fingerprints expresses the similarity:
Note that a novelty of 0 means that the molecule’s fingerprint matches exactly the fingerprint of a compound in the reference database.
The novelty distributions of the generated molecules with respect to both the Pubchem [pubchem_kim] database and our training set is given in Figure 3. The molecular fingerprints are further used to determine nearest neighbors within the generated molecules. Figure 5 shows an example of the most similar generated molecules (connected blue dots) of a selected generated molecule (in purple), that allows manual evaluation and filtering based on molecular diversity.
When compared with the training database of size 1.9 M, we find that the likelihood of generating molecules with a novelty value of 0 is 2%. With respect to the larger Pubchem database consisting of 103 M molecules, the majority of which were not included in model training, we find the percentage of generated molecules with novelty value of 0 is 9.5%, 3.7%, and 8.3% generated molecules for main protease, RBD, and NSP9, respectively. Only 19, 5, and 15 of them match exactly with an existing SMILES string in Pubchem.
|Target||Pred. Affinity||Docking Energy||CID||Biological Activity|
|NSP9 Dimer||-7.60 (2)||Antagonist of rat mGluR|
|-6.30 (1)||Active to human S6 kinase|
|-6.30 (3)||Matrix metalloproteinase inhibitor|
|Main Protease||-6.30 (1)||Dihydrofolate reductase inhibitor|
|-7.00 (1)||Shiga toxin inhibitor|
|RBD||-6.80 (1)||Plasmepsin inhibitor|
As shown in Table 4, some of these SMILES are reported with biological activity in Pubchem, which calls for further investigation. For example, the molecule with Pubchem Compound ID (CID) 76332092 (labeled in the data download) is a known Plasmepsin-2 and Plasmepsin-4 inhibitor. Plasmepsin family of proteins is a potential target for antimalarial drugs due to their haemoglobin-degrading activity. CID 76332092 has also shown antimalarial activity against chloroquine-sensitive Plasmodium falciparum 3D7 infected in RBV after 72 hrs. Given that RBD from S protein binding to angiotensin-converting enzyme-2 (ACE-2) receptor is needed for viral entry to host cells [hoffmann2020sars], both RBD and ACE-2 receptor are being actively investigated as COVID-19 targets. Chloroquine (and its hydroxy derivative) that is a known ACE-2 inhibitor has been already considered as a promising COVID-19 drug [liu2020hydroxychloroquine]. Given that chloroquine is also an anti-malaria drug, and that CID 76332092 shows a predicted pIC50 of 7.82 and lowest docking binding free energy of -6.80 kcal/mol (Figure 9) to the ACE-2 binding pocket of RBD (Table 4), this molecule deserves further investigation in the context of SARS-CoV-2. It is also noteworthy that the generated molecule with highest predicted pIC50 for RBD (and with docking binding free energy of -6.9 kcal/mol in the top binding mode with pocket 1) shares a strong maximum common subgraph similarity [cao2008maximum] with Telavancin, an approved Pneumonia drug, as shown in Figure 6. These results indicate that CogMol can generate promising and biologically relevant drug candidates beyond the training dataset.
5 Conclusions and Future Work
In this paper, we proposed CogMol, a framework for Controlled Generation of Molecules with a set of desired attributes. Our framework can handle candidate generations for multiple targets using the same trained model and also allows for explicit accounting of off-target selectivity. We also developed a method for screening generated candidates by using a model for predicting in vitro and clinical toxicity. Docking to the target protein shows the potential of these generated molecules as ligands. The novelty of the generated molecules was further investigated with respect to molecules in Pubchem. We expose 1,000 novel drug candidates for each of the three targets of the SARS-CoV-2 virus in the user interface.333https://covid19-mol.mybluemix.net/Early structural analyses of the generated molecules reveal possible biological association that might be of relevance to COVID-19 therapeutic design. Validation of the generated promising molecules is underway and the paper will be updated with more results.
Appendix A Performance of MT-DNN Toxicity Predictor
Tables 5 and 6 show the performance of toxicity prediction using the MT-DNN and the random forest as the baseline. The MT-DNN significantly outperforms the RF model in terms of true positive rate, recall, and F1 score, while incurring a small penalty in ROC AUC and precision.