Over the last 20 years, there is a dramatic rise in the use and misuse of opioids in the United States, including misuse of prescription opioids, resurgence in heroin use and increase in abuse of illicit synthetic opioids such as fentanyl, which led to the current opioid epidemic and caused a rising number of overdose deaths29. According to the Centers for Disease Control and Prevention, the rate of opioid overdose deaths keeps rising from 1999 to 2018, posing a major threat to public health12. Opioid overdose happens when an excessive amount of opioid agonists work on the -opioid receptor (MOR) in the brain, resulting in respiratory depression and eventually death27. To reverse opioid overdoses, naloxone as shown Figure 1, an antagonist to the MOR, is used as the most common antidote, usually in the nasal formulation so as to efficiently bypass the blood brain barrier (BBB) and exert an immediate effect29.
However, naloxone can be distributed away from the brain rapidly, leading to a brief period of pharmacodynamic action, which is possibly caused by its limited BBB permeability4. Considering that opioid agonists have a longer half-life, there is a risk of inadequate response or re-narcotization after a single dose of naloxone, especially in patients who have taken large doses or long-acting opioid formulations26. Administration of repeated doses of naloxone may be necessary if respiratory depression recurs32. At the same time, the price of naloxone nasal spray is high 11. Other attempts to lengthen the time for reversing opioid overdose, such as combining naloxone with other opioid antagonists, have failed 17. For all these reasons, there is a demand for more effective opioid antagonists with enhanced brain retention ability, which corresponds to high BBB permeability.
Nevertheless, developing new drugs costs 2.6 billion dollars on average, and can take more than 10 years 1. Drug discovery for lead compounds, i.e., promising drug candidates, requires iterative organic synthesis and screening assays, with a failure rate higher than 90% 14
. Recently, the increase in the amount of chemical and biomedical data has encouraged the use of ‘data-hungry’ machine learning algorithms such as deep learning to generate and optimize molecules, which significantly accelerates the drug discovery process by reducing resources spent on wet-lab synthesis and characterization of bad lead compounds2; 7. By representing molecules as simplified molecular-input line-entry system (SMILES) strings, the generation of potential drug molecules can be treated as a sequence generation problem. With reinforcement learning (RL)31, the generation can be biased towards molecules with certain desired properties. For example, Popova et al used RL to train a molecule generator to generate novel compound libraries with a desired physicochemical or biological property 25.
For naloxone, it targets the central nervous system (CNS). Several physicochemical factors underlie permeation through the BBB for CNS drugs22. For instance, CNS active drugs tend to have smaller molecular weight (MW). Molecules with MW less than 500 can undergo significant free diffusion and when MW increases from 200 to 450, BBB permeability decreases 100-fold. Besides, CNS drugs must have sufficient lipophilicity (measured by the partition coefficient between octanol and water, logP) to cross the hydrophobic phospholipid bilayer of cell membranes. One example of increased BBB permeability with higher logP is that heroin (logP=2.3) exhibits much higher brain uptake than morphine (logP=0.99). Besides, solubility (measured by logS) is also an important property because successful nasal products like naloxone nasal spray usually require the active ingredient to be highly soluble32.
The driving question, in this study, is whether there can be molecules with both sufficient opioid antagonistic effect (i.e., a higher negative logarithm of the experimental half maximal inhibitory concentration, pIC50) and enhanced brain retention ability (i.e., a smaller MW and a higher logP) while maintaining high solubility (i.e., a higher logS). Given that the number of drug-like molecules is estimated to be between
Given that the number of drug-like molecules is estimated to be betweenand , routine virtual screening on existing compound libraries can not guarantee finding molecules with multiple desired properties and exhaustive searching in the huge chemical space can be prohibitively expensive25. Therefore, a multi-objective deep reinforcement learning (DRL) framework is used for the discovery of better opioid antagonists.
2.1 A Deep Reinforcement Learning Framework
Our framework consists of three major components: 1) a generative model based on an RNN model that can generate SMILES strings; 2) a predictive model that predicts the properties of interest for a given SMILES string; and 3) an RL engine which biases the generative model towards generating SMILE strings with desired properties, the values of which are predicted by the predictive model.
Inspired by Popova et al25, we first train the generative model on a large corpus (1.9M) of real-world compound SMILES strings to learn the syntax of SMILES so that the generative model is able to generate valid SMILES strings. The learned weights provide a good initialization for the generative model during the RL stage. Second, we train the predictive model which contains a predictive sub-model for every property of interest. In this paper, we built three sub-models for pIC50, logP and logS respectively. There is no sub-model for MW since it can be directly calculated. With the learned predictive model and well-initialized generative model, we use an RL algorithm, REINFORCE 33, to further train the generative model in an end-to-end fashion such that the generated SMILES strings can have the desired properties. Figure 2 depicts an overview of the DRL framework.
Reinforcement Learning. Reinforcement learning refers to the problem of learning an optimal decision-making policy to acquire the maximal amount of rewards in a sequential decision scenario 31. Given the previously generated SMILES characters where is the start token, the stochastic policy (i.e., the generative model) samples the next SMILES character as its action. Then, a reward is provided by the reward function and the state will be updated to . This generation process repeats until the generative model samples a termination token or reaches a predefined maximum length of a SMILES string. RL seeks an optimal policy to maximize the expected cumulative future rewards (i.e., return)
where is the set of all candidates policies , and is a sampled roll-out (i.e., a SMILES string) of length from . is the discount factor and usually . Solving equation (1) is a difficult problem, and many algorithms have been proposed in past decades 33; 28; 31. Here, we use a classical RL algorithm called REINFORCE.
REINFORCE. REINFORCE belongs to a family of RL algorithms called policy gradient methods 31 which estimates the gradient of certain performance measures of a decision-making policy and inputs the gradient into a stochastic gradient ascent algorithm to improve the policy toward higher total rewards. Formally, if , REINFORCE seeks to maximize
where captures the expected return under the distribution of all possible state and action sequences. Note that in REINFORCE, the policy is probabilistic, i.e.,
is the probability of taking actionin state . Hence, the gradient of the objective function can be written as follows
However, computing is non-trivial due to the high dimensionality of the space of possible state and action sequences. REINFORCE addresses this problem by using Monte Carlo sampling and approximating the gradient by
At each iteration, REINFORCE samples roll-outs from the current policy (i.e., the generative model), which are used to estimate using Equation (2). Then, parameters of the policy can be updated as
where is the learning rate.
Due to the high variance in the sampling process, training can be unstable. To address this, a baseline rewardis often estimated and subtracted from . Hence, the gradient becomes
Thus, REINFORCE can learn the parameters of the generative model in an end-to-end fashion by using backpropagation. During training, actions leading to higher total rewardswill be reinforced through increasing ; while actions resulting in lower total reward will be suppressed by decreasing .
The Multi-Objective Reward Function. RL algorithms require a properly defined reward function. In this paper, we aim to learn a generative model that is able to generate SMILES strings with multiple desired properties: 1) smaller MW; 2) higher logP; 3) higher logS and 4) higher pIC50. Note that we only build predictors for logP, logS and pIC50, while MW is computed directly from the SMILES string. Therefore, we introduce a multi-objective reward function which is a weighted sum of these properties
where , is the weight assigned to each property and is the predicted property value for the generated SMILES string . Besides, we assign a negative weight to MW to convert the minimization task to maximization. In addition, to ensure the validity of most generated SMILES strings, we regularize the generative model by penalizing the model when it generates an invalid SMILES string, with a negative reward . Hence, we define the reward function as follows:
The reward is only provided at the th (last) step of the generation. REINFORCE uses this reward function to learn a customized generative model that is capable of generating valid SMILES strings with multiple desired properties.
2.2 Data Collection
In order to train the generative model, we set up a SMILES-strings corpus with 1,870,310 unique compounds, which is retrieved from ChEMBL25 database10. Note that all SMILES strings here are canonicalized, which means that they are uniquely mapped to compounds. To train the predictive model, we acquire logS and logP data from the literature30; 25 and removed the duplicates. IC50 data against MOR (ChEMBL ID: 4354) are retrieved from ChEMBL25 database10. We only include compounds with explicit IC50 values at the same scale. If the IC50 value is low, then the corresponding compound is highly potent against its target since only a very little amount of the compound can inhibit the target. We take their negative logarithm to get the pIC50 dataset. A high pIC50 means that the compound has high potency. Basic statistics for the three datasets are summarized in Table 1.
2.3 Model Architecture
The Generative Model.
To accelerate the training of the generative model in the RL stage, we first train the generative model to learn the syntactical rules for constructing SMILES strings. At each time step, the generative model takes a current prefix string of a training instance (i.e., a SMILES string), and predicts the probability distribution of the next character (Figure3(a)). A cross-entropy loss is calculated at each step and parameters of the model are updated through back propagation. By treating each step as a multi-class classification problem, we fit the generative model to existing SMILES strings such that the model can generate valid SMILES strings. Importantly, the learned weights later serve as a good initialization for the generative model and expedite the training in the RL stage.
The generative model is a recurrent neural network19 which is capable of generating SMILES strings. Essentially, a SMILES string is composed of a sequence of characters from a vocabulary and for all . To facilitate the generation process, we append a start token and a termination token to the head and tail of , respectively. As shown in Figure 3(b), at each time step , the input to the generative model is a character ( is the start token). The generative model first uses an embedding layer to convert the categorical character
into an embedding vector of continuous scalars, which is then processed by a recurrent layer to update its hidden state. Finally, a dense layer and a softmax layer are used to map the hidden state to a probability distribution of the next possible character, from which we sample the next character. By repeating this process until a termination token is sampled, the generative model generates a complete SMILES string.
The Predictive Model. The predictive model consists of multiple property predictors, each for one property of interest. Here, we consider three properties, namely, logP, logS and pIC50. Property prediction is essentially a regression task where we aim to map a SMILES string to a scalar value. All property predictors are RNN-based models with the same architecture. Figure 4 illustrates the predictive model architecture. First, a SMILES string is passes through an embedding layer, converting each character in the SMILES string into an embedding vector. Second, the recurrent layer sequentially processes the embedding vectors and constructs a temporal feature vector for the input SMILES string. Last, three consecutive dense layers are used to map the feature vector to a property value. The network is trained with a mean squared error (MSE) loss.
However, a large number of training examples are often required for deep learning models, like RNN, before they can achieve superior performance. For cases where only limited training examples are available, Support Vector Machines5
and Random Forests20 are often more competitive. Hence, for each property, we compare three different models: Support Vector Machines 5, Random Forests 20
and the proposed RNN-based model, and select the classifier with the smallest MSE. We find Random Forest works best for pIC50 prediction; while the RNN-based model works best for logS and logP prediction. This can be due to the fact that we only have a small number of data points (1k) for pIC50. In contrast, 14k and 10k training examples are available for logP and logS, respectively.
2.4 Implementation Details.
In Equation (6), we set the penalty of invalid SMILES to , where and
are the average and standard deviation of the weighted sum of rewards (i.e.,in Equation (5)) of the SMILES strings sampled from the initialized generative model (prior to the RL training stage). Importantly, we empirically find that the RL algorithm is sensitive to the and our setting of strikes a good balance between validity and desired properties. We assign an equal weight of to logP, logS and pIC50, and to MW. During the RL training stage, the learning rate and maximum length of SMILES strings are set to and , respectively. The generative model is trained for 240 episodes. In each episode, we sample 200 SMILES strings with a batch size of 10.
The size of the dictionary is 58 with 56 distinct SMILES tokens plus a start token and a termination token. Each token is embedded into a 512-dimensional vector. Traditional RNN models like GRU 3 and LSTM 13 are inferior in memorizing and counting thus often fail to capture algorithmic patterns in sequences 16. Meanwhile, the construction of SMILES strings needs to follow certain algorithmic rules such as atom valence constraints and bracket opening-closure. Thus, here we use Stack-augmented GRU (StackGRU) 16
as the recurrent layer in the generative model. The stack width, stack depth and hidden size of StackRNN are 256, 200, 512, respectively. For the predictive model, an ordinary single-layer GRU with hidden size 512 is used as the recurrent layer. The following three dense layers have dimension sizes 128, 32, and 1 respectively, the first two of which are followed by a ReLU layer and a batch-norm layer15
. The deep learning models are implemented with PyTorch23. RDKit 18 is used for validating and visualizing the molecules. The Random Forest model for predicting pIC50 has trees and is implemented with Scikit-Learn 24.
3.1 Evaluation of the DRL Framework
GRU vs StackGRU. We first compare the GRU and StackGRU in learning the syntax of SMILES strings. Specifically, we train two generative models (see Figure 3(a): one with GRU as the recurrent layer; the other with StackGRU) on the training corpus, and sample 10k SMILES strings from each model. By using syntactical and chemical validity check functions from RDKit, we calculate the percentage of syntactically and chemically valid compounds. We also measure the novelty of generated compounds by calculating the percentage of non-overlapping compounds between the sample and the training corpus. Furthermore, by examining the percentage of non-duplicates within the sample, we quantify the uniqueness of the generated sample.
|Configuration||Syntactical Validity (%)||Chemical Validity (%)||Novelty (%)||Uniqueness (%)|
As can be seen from Table 2, both syntactical validity and chemical validity are relatively low when using a standard GRU compared to the StackGRU. With StackGRU, syntactical validity increased to 87.29% and chemical validity increased to 77.40%, which demonstrates that StackGRU is better at learning the SMILES syntax. Both novelty and uniqueness are close to 100%, which indicates that the generative model is able to generate novel and unique SMILES strings, and does not just memorize training examples.
Property Prediction. For the predictive model, we plot the predicted value vs true value in Figure 5.
Indicated by a high correlation coefficient, i.e., and low rooted mean squared error (RMSE), the predictors for logP and logS have high precision and accuracy. However, the sub-predictive model for pIC50 has relatively poor performance, which can be caused by the insufficient data points in the pIC50 dataset. We also apply the predictive model to Naloxone. The predicted values for logP and logS of Naloxone are 1.94 and -2.67, respectively, which align reasonably well with the reported properties of Naloxone (logP=1.47 and logS=-1.8) 34.
Molecule Generation. Given that our goal is to bias the generation of molecules towards higher logP, logS, pIC50 and smaller MW, we sample 10k SMILES strings after the generative model is trained using REINFORCE for 0, 80, 160 and 240 episodes, respectively.
Figure 6 shows the distribution of each property value for chemically valid strings from the samples. The pIC50 of the generated molecules are biased toward higher values, albeit not significantly. The distribution of logS is significantly shifted to the right and the distribution of MW is also significantly shifted to the left, which means molecules with higher logS and smaller MW are more likely to be generated over training episodes. However, generated molecules tend to have lower logP values, indicated by the left-shifted distribution of logP. This phenomenon is probably because logP and logS are contradictory by nature. When there are more hydrophilic groups, higher logS and lower logP are expected and vice versa when there are more hydrophobic groups. Overall, our DRL framework is able to bias the properties of generated molecules.
Table 3 summarizes the syntactical validity, chemical validity, novelty and uniqueness of the generated samples.
|Episode||Syntactical Validity (%)||Chemical Validity (%)||Novelty (%)||Uniqueness (%)|
One major issue with RL in de novo drug design in previous studies is the reduced validity25. Here, by incorporating penalty for invalid SMILES strings in the reward function, our DRL framework generates SMILES strings approaching 100% validity when the training episodes increase. Besides, the novelty of the generated samples is also high. Uniqueness is decreasing as the number of episodes goes up since as the DRL training episodes increase, the generative model tends to converge to the distribution of a smaller number of SMILES strings with the desired properties.
3.2 Identification of Potential Lead Compounds
From the RL-trained generative model at the 160th episode, we sample 10k SMILES strings and then use the following criteria to filter out molecules with: 1) logP , 2) logS , 3) pIC50 and 4) MW . The reason why we choose the 160th episode is that it enables balanced performance with regard to the validity, novelty, uniqueness and the ability to shift property distribution. The cutoff values used in logP, logS and MW are the predicted/calculated values for naloxone in our DRL framework. Note that our goal is to discover molecules with sufficient inhibitory activity against MOR and optimal logP, logS and MW values to ensure prolonged brain retention ability. We set the cutoff value of pIC50 at 6 despite the predicted pIC50 for naloxone by our predictive model is 6.93 since pIC50 corresponds to active compounds25.
We filtered out six novel SMILES strings from the 10k-size sample, as the identified potential lead compounds. Figure 7 (a) shows their structures drawn by RDKit, which are simple by direct visual checking. We also calculated their synthetic accessibility score (SAS)8. SAS can range between 1 and 10. A high SAS, usually above 6, corresponds to high molecule complexity and increased synthesis difficulty. For the six molecules, their SAS values range from 1.40 to 3.12, indicating that our DRL framework generates highly feasible molecules.
To further demonstrate the usefulness of the DRL framework, we also sample 10k SMILES strings at the 0th episode (i.e., without RL training) and filter with the same criteria. Three SMILES strings are filtered out. Figure 7 (b) shows their structures and predicted properties. Despite having the expected properties, their structures are very complex with high SAS values, which indicates that the molecules generated without RL training are much less feasible.
4 Conclusion and Discussion
Recent years have seen an increasing use of deep learning in drug discovery with the rise of the ‘big data’ era2. Linear representations of molecules, such as the SMILES strings, are broadly used in ligand-based drug discovery studies21. By linking molecules to end points like physicochemical properties, inhibitory activity, BBB permeability, etc, end-to-end drug discovery and development is now becoming a reality 6. For example, in 2019, Insilico Medicine succeeded in using deep learning to design new lead compounds for discoidin domain receptor 1 (DDR1) kinase inhibitors from scratch in just 21 days35.
In this study, we aim to discover better opioid antagonists to help to combat the opioid epidemic, where the most common antidote for opioid overdose, naloxone, has limited blood brain barrier permeability. To accelerate the discovery for better opioid antogonists candidates, we implement a multi-objective DRL framework. The framework is able to identify valid, novel and feasible molecules with sufficient opioid antagonistic activity and enhanced brain retention ability.
More importantly, the proposed multi-objective DRL framework has great potential in accelerating drug discovery, which is a multi-property optimization task per se. For instance, effective and safe drugs need to exhibit a fine-tuned combination of pharmacokinetic and pharmacodynamic properties, such as high potency, affinity and selectivity against the drug target as well as optimal absorption, distribution, metabolism, excretion and toxicity (ADMET)9. Our study shows that with well-designed reward functions, the multi-objective DRL framework can be customized to generate molecules with optimal properties from different drug development aspects.
Advancing drug discovery via artificial intelligence. Trends Pharmacol Sci. Cited by: §1.
- The rise of deep learning in drug discovery. Drug Discov Today 23 (6), pp. 1241–1250. Cited by: §1, §4.
On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259. Cited by: §2.4.
- Naloxone in opioid poisoning: walking the tightrope. Emerg Med J 22 (9), pp. 612–616. Cited by: §1.
- Support-vector networks. Mach Learn 20 (3), pp. 273–297. Cited by: §2.3.
- Exploiting machine learning for end-to-end drug discovery and development. Nat Mater 18 (5), pp. 435. Cited by: §4.
- Deep learning for molecular design-a review of the state of the art. Mol Syst Des Eng. Cited by: §1.
- Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminformatics 1 (1), pp. 8. Cited by: §3.2.
- ADMET modeling approaches in drug discovery. Drug Discov Today. Cited by: §4.
- The ChEMBL database in 2017. Nucleic Acids Res 45 (D1), pp. D945–D954. Cited by: §2.2.
- The rising price of naloxone—risks to efforts to stem overdose deaths. N Engl J Med 375 (23), pp. 2213–2215. Cited by: §1.
- Drug Overdose Deaths in the United States, 1999-2018. . Cited by: §1.
- Long short-term memory. Neural Comput 9 (8), pp. 1735–1780. Cited by: §2.4.
- Principles of early drug discovery. Br J Pharmacol 162 (6), pp. 1239–1249. Cited by: §1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.4.
- Inferring algorithmic patterns with stack-augmented recurrent nets. In NeurIPS, pp. 190–198. Cited by: §2.4.
- Pharmacokinetic interaction between naloxone and naltrexone following intranasal administration to healthy subjects. Drug Metab Dispos 47 (7), pp. 690–698. Cited by: §1.
RDKit: open-source cheminformatics. Cited by: §2.4.
- Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §2.3.
- Classification and regression by randomforest. R News 2 (3), pp. 18–22. Cited by: §2.3.
- Advances and perspectives in applying deep learning for drug design and discovery. Front Robot AI 6, pp. 108. Cited by: §4.
- Pathways for small molecule delivery to the central nervous system across the blood-brain barrier. Perspectives in Medicinal Chemistry 6, pp. PMC–S13384. Cited by: §1.
- PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, pp. 8024–8035. Cited by: §2.4.
- Scikit-learn: machine learning in python. J Mach Learn Res 12 (Oct), pp. 2825–2830. Cited by: §2.4.
- Deep reinforcement learning for de novo drug design. Sci Adv 4 (7), pp. eaap7885. Cited by: §1, §1, §2.1, §2.2, §3.1, §3.2.
- Naloxone dosage for opioid reversal: current evidence and clinical implications. Ther Adv Drug Saf 9 (1), pp. 63–88. Cited by: §1.
- Opioid Overdose. In StatPearls [Internet], Cited by: §1.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.1.
- The opioid epidemic: crisis and solutions. Annu Rev Pharmacol Toxicol 58, pp. 143–159. Cited by: §1.
- AqSolDB, a curated reference set of aqueous solubility and 2d descriptors for a diverse set of compounds. Sci Data 6 (1), pp. 1–8. Cited by: §2.2.
- Reinforcement learning: an introduction. MIT press. Cited by: §1, §2.1, §2.1.
- A response to the opioid overdose epidemic: naloxone nasal spray. Drug Deliv Transl Res 3 (1), pp. 63–74. Cited by: §1, §1.
- Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8 (3-4), pp. 229–256. Cited by: §2.1, §2.1.
- DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res 46 (D1), pp. D1074–D1082. Cited by: Figure 1, §3.1.
- Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nat Biotechnol 37 (9), pp. 1038–1040. Cited by: §4.