Log In Sign Up

Optimization of Molecules via Deep Reinforcement Learning

We present a framework, which we call Molecule Deep Q-Networks (MolDQN), for molecule optimization by combining domain knowledge of chemistry and state-of-the-art reinforcement learning techniques (prioritized experience replay, double Q-learning, and randomized value functions). We directly define modifications on molecules, thereby ensuring 100 Further, we operate without pre-training on any dataset to avoid possible bias from the choice of that set. As a result, our model outperforms several other state-of-the-art algorithms by having a higher success rate of acquiring molecules with better properties. Inspired by problems faced during medicinal chemistry lead optimization, we extend our model with multi-objective reinforcement learning, which maximizes drug-likeness while maintaining similarity to the original molecule. We further show the path through chemical space to achieve optimization for a molecule to understand how the model works.


page 1

page 2

page 3

page 4


Molecular Design in Synthetically Accessible Chemical Space via Deep Reinforcement Learning

The fundamental goal of generative drug design is to propose optimized m...

Towards Better Opioid Antagonists Using Deep Reinforcement Learning

Naloxone, an opioid antagonist, has been widely used to save lives from ...

Molecule optimization via multi-objective evolutionary in implicit chemical space

Machine learning methods have been used to accelerate the molecule optim...

Curiosity in exploring chemical space: Intrinsic rewards for deep molecular reinforcement learning

Computer-aided design of molecules has the potential to disrupt the fiel...

MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization

Molecule optimization is a fundamental task for accelerating drug discov...

Deep learning for molecular generation and optimization - a review of the state of the art

In the space of only a few years, deep generative modeling has revolutio...

1 Introduction

One fundamental goal in chemistry is to design new molecules with specific desired properties. This is especially important in material design or drug screening. Currently, this process is expensive in terms of time and cost: It can take years and cost millions of dollars to find a new drug.1 The goal of this study is to partially automate this process through reinforcement learning.

To appreciate our approach, it is necessary to review briefly the previous works that employed machine learning in molecule design. One prevalent strategy is to build a generative model, which maps a point in a high-dimensional latent space to a molecule, and perform search or optimization in the latent space to find new molecules.

Gómez-Bombarelli et al. 2, Blaschke et al. 3, Segler et al. 4, Lim et al. 5, and Putin et al. 6 utilized strings as molecule representations to build a generator of SMILES7 strings, which is a linear string notation to describe molecular structures. One of the most challenging goals in this design is to ensure the chemical validity of the generated molecules. Kusner et al. 8 and Dai et al. 9 added grammar constraints to SMILES strings to improve the chemical validity of the generated molecules. Researchers have also built models on graph representations of molecules, which regards atoms as nodes and bonds as edges in an undirected graph. Li et al. 10 and Li et al. 11 described molecule generators that create graphs in a step-wise manner. De Cao and Kipf 12 introduced MolGAN for generating small molecular graphs. Jin et al. 13 designed a two-step generation process in which a tree is first constructed to represent the molecular scaffold and then expanded to a molecule. Although almost perfect on generating valid molecules, these generative models still need to address the problem of optimization. Most published work uses a separate Gaussian process model on the latent space for optimization. However, because the latent space is often high dimensional and the objective functions defined on the latent space is usually non-convex, molecule property optimization on the latent space can be difficult.

Another strategy is based on reinforcement learning, which is a sub-field of artificial intelligence. Reinforcement learning studies the way to make decisions to achieve the highest reward.

Olivecrona et al. 14, Guimaraes et al. 15, Putin et al. 16, and Popova et al. 17 applied reinforcement learning techniques on top of a string generator to generate the SMILES strings of molecules. They successfully generated molecules with given desirable properties, but struggled with chemical validity. Recently, You et al. 18 proposed a graph convolutional policy network (GCPN) for generating graph representations of molecules with deep reinforcement learning, achieving 100% validity. However, all these methods require pre-training on a specific dataset. While pre-training makes it easier to generate molecules similar to the given training set, the exploration ability is limited by the biases present in the training data.

Here we introduce a new design for molecule optimization by combining chemistry domain knowledge and reinforcement learning, which we call Molecule Deep

-Networks (MolDQN). We formulate the modification of a molecule as a Markov decision process (MDP).

19 By only allowing chemically valid actions, we ensure that all the molecules generated are valid. We then employ the deep reinforcement learning technique of Deep -Networks (DQN) 20 to solve this MDP, using the desired properties as rewards. Instead of pre-training on a dataset, our model learns from scratch. As presented below, MolDQN outperforms other state-of-the-art models on both property targeting and optimization tasks. Additionally, with the introduction of multi-objective deep reinforcement learning, our model is capable of performing multi-objective optimization in a sample efficient way.

Our contribution differs from previous work in three critical aspects:

  1. All the works presented above use policy gradient methods, while ours is based on value function learning. Although policy gradient methods are applicable to a wider range of problems, they suffer from high variance when estimating the gradient.

    21 In comparison, in applications where value function learning works, it is usually more stable and sample efficient.20

  2. Most, if not all, of the current algorithms rely on pre-training on some datasets. Although expert pre-training may lead to lower variance, this approach limits the search space and may miss the molecules which are not in the dataset. In contrast, our method starts from scratch and learns from its own experience, which can lead to better performance, i.e., discovering molecules with better properties.

  3. Our model is designed for multi-objective reinforcement learning, allowing users to decide the relative importance of each objective. Additionally, our design allows changing the objective weights in order to address another need after training. See 3.3 for more detail.

2 Methods

2.1 Molecule Modification as a Markov Decision Process

Intuitively, the modification or optimization of a molecule can be done in a step-wise fashion, where each step belongs to one of the following three categories: (1) atom addition, (2) bond addition, and (3) bond removal. The molecule generated is only dependent on the molecule being changed and the modification made. Therefore, the process of molecule optimization can be formulated as a Markov decision process (MDP). We have several key differences from previous work that employed MDP for molecule modification. 18

  • We add an explicit limit on the number of steps. This allows us to easily control how far away from a starting molecule we can go. In vast chemical space, this is a very natural way to control the diversity of molecules produced. Consequently, we do not need a discount factor for future rewards in the usual way (see below).

  • We do not allow chemically invalid actions (violations of valence constraints). These actions are removed from the action space entirely and are not even considered by our model.

  • We allow atoms/bonds to be removed as well as added.

Formally, we have MDP, where we define each term in what follows:

  • denotes the state space, in which each state is a tuple of . Here is a valid molecule and is the number of steps taken. For the initial state, the molecule can be a specific molecule or nothing, and . We limit the maximum number of steps that can be taken in this MDP. In other words, the set of terminal states is defined as , which consists of the states whose step number reaches its maximum value.

  • denotes the action space, in which each action is a valid modification to a specific molecule . Each modification belongs to one of the following three categories mentioned before:

    1. Atom addition. Firstly, we define the set of be the set of elements a molecule contains. We then define a valid action as adding (1) an atom in and (2) a bond (with all valence-allowed bond orders) between the added atom and the original molecule wherever possible. For example, with the set of elements , the atom addition action set of cyclohexane contains the 4 actions shown in Figure 1a. Note that hydrogens are considered implicitly, and all atom additions are defined as replacements of implicit hydrogens.

    2. Bond addition. A bond addition action is performed between two atoms with free valence (not counting implicit hydrogens). If there is no bond between those two atoms, actions between them consist of adding a single, double, or triple bond if the valence allows this change. Additional actions increase the bond order between those two atoms by one or two. In other words, the transitions include:

      • No bond {Single, Double, Triple} Bond.

      • Single bond {Double, Triple} Bond.

      • Double bond {Triple} Bond.

      To generate molecules that are chemically more reasonable, we include several heuristics that incorporate chemistry domain knowledge. First, in order to prevent generating molecules with high strain, we do not allow bond formation between atoms that are in rings. In addition, we added an option that only allows formation of rings with a specific number of atoms. In the experiments we have done, only rings with 3 to 6 atoms are allowed in consideration of the most common ring sizes. As an example, Figure 

      1b shows the allowed bond addition actions for cyclohexane.

    3. Bond removal. We define the valid bond removal action set as the actions that decrease the bond order of an existing bond. The transitions include:

      • Triple bond {Double, Single, No} Bond.

      • Double bond {Single, No} Bond.

      • Single bond {No} Bond.

      Note that bonds are only completely removed if the resulting molecule has zero or one disconnected atom (and in the latter case, the disconnected atom is removed as well). Therefore, no molecules having disconnected parts are created in this step.

    (a) Atom addition (b) Bond addition (c) Bond removal
    Figure 1: Valid actions on the state of cyclohexane. Modifications are shown in red. Invalid bond additions which violate the heuristics explained in Section 2.1 are not shown.

    Note that the breaking of an aromatic bond cannot be done in our definition of the MDP. However, an aromatic system can still be created in a stepwise way by adding single and double bonds alternatively, as the resulting system will be perceived as aromatic by the RDKit SMILES parser. We also include “no modification” as an action, which allows the molecule to remain unchanged before reaching the step limitation .

  • denotes the state transition probability. Here we define the state transition to be deterministic. For example, if we modify a molecule by adding a single bond, the next state we reach will be the new molecule adding the bond, with a probability of 1.

  • denotes the reward function of state . In material design or lead optimization, the reward is often a property of the molecule . In our design, a reward is given not just at the terminal states, but at each step, which empirically produces better learning performance (see Figure S3). To ensure that the final state is rewarded most heavily, we discount the value of the rewards at a state with time by (where we typically used ). In future discussions of reward , this discount factor is implicitly included for simplicity.

Implementation details.

We implemented the state transition of a molecule with the available software framework of RDKit.22. The properties of molecules are calculated with tools provided by RDKit.

2.2 Reinforcement Learning

Reinforcement Learning is an area of machine learning concerning how the decision makers (or agents) ought to take a series of actions in a prescribed environment so as to maximize a notion of cumulative reward, especially when a model of the environment is not available. Here, the environment is the molecule modification MDP we defined above, and our goal is to find a policy which selects an action for each state that can maximize the future rewards.

Intuitively, we are trying to fit a function that predicts the future rewards of taking an action on state . A decision is made by choosing the action that maximizes the function, which leads to larger future rewards.

Mathematically, for a policy , we can define the value of an action on a state to be

where denotes taking an expectation with respect to , and denotes the reward at step . This action-value function calculates the future rewards of taking action on state , and subsequent actions decided by policy . We can therefore define the optimal policy .

In our case, however, we have both a deterministic MDP and an accurate model of the environment. Therefore, is just the value function for new state plus the reward for . Thus, the model we build approximates as .

Under the setting that the maximum number of steps is limited, the MDP is time-dependent, and the optimal policy will be time-dependent as well. Naturally, if there are many steps left, we can risk pursuing later but larger rewards, while if only a few steps remain, we should focus on rewards that can be obtained sooner.

We adopt a deep -learning 20 algorithm to find an estimate of the

function. We refer to a neural network function approximator as the parameterized

-value function , where

is the parameter. This approximator can be trained by minimizing the the loss function of

where is the target value.

2.3 Multi-Objective Reinforcement Learning

In real-world applications like lead optimization, it is often desired to optimize several different properties at the same time. For example, we may want to optimize the selectivity of a drug while keeping the solubility in a specific range. Formally, under the multi-objective reinforcement learning setting, the environment will return a vector of rewards at each step

, with one reward for each objective, i.e. , where is the number of objectives.

We implemented an extension23 of the deep -networks (DQN) algorithm by using a separate network for each (actually, a multitask neural network with a separate head for each objective; see Section 2.5), where . We then denote the vector of values as .

Similar to the case of single objective -learning, the optimal action can be chosen from the maximum “scalarized” -value: , where is the weight vector that denotes the priorities on the objectives.

The key advantage of this formulation of separate functions is that it allows the weights used during training to differ from the weights used during evaluation. However, this method is not fully general or guaranteed to converge to the optimal policy for all weights, especially in cases where there is direct competition among the objectives (see Mossalam et al. 24 for example). However, we have found this method to be effective with this MDP and leave a deeper consideration of the limits of this approach to future work.

2.4 Exploitation vs. Exploration During Training

The trade-off between exploitation and exploration presents a dilemma caused by the uncertainty we face. Given that we do not have a complete knowledge of the rewards for all the states, if we constantly choose the best action that is known to produce the highest reward (exploitation), we will never learn anything about the rewards of the other states. On the other hand, if we always chose an action at random (exploration), we would not receive as much reward as we could achieve by choosing the best action.

One of the simplest and the most widely used approaches to balance these competing goals is called -greedy, which selects the predicted best action with probability , and a uniformly random action with probability . Without considering the level of uncertainty of the value function estimate, -greedy often wastes exploratory effort on the states that are known to be inferior. To counter this issue, we followed the idea from Osband et al. 25 by utilizing randomized value functions to achieve deep exploration. We built independent -functions , each of them being trained on a different subset of the samples. At each episode, we uniformly choose , and use for decision making. The above approach is combined with -greedy as our policy. During training, we annealed from 1 to 0.01 in a piecewise linear way.

2.5 Deep -Learning Implementation Details

We implemented the deep -learning model described by Mnih et al. 20 with improvements of prioritized experience replay 26 and double -learning27. Recall that a state is a pair of molecule and time . Unsurprisingly, including in the model performs better experimentally (see Figure S4).

We used a deep neural network to approximate the -function. The input molecule is converted to a vector form called its Morgan fingerprint28

with radius of 3 and length of 2048, and the number of steps remaining in the episode was concatenated to the vector. A four-layer fully-connected network with hidden sizes of [1024, 512, 128, 32] and ReLU activations is used as the network architecture. Its output dimension is the number

(see above; for computational efficiency, we implement these different models as multiple outputs on top of shared network layers). In most experiments, we limit the maximum number of steps per episode to 40, given that most drug molecules have less than 40 atoms (the exception is for the experiments in Section 3.2, where we limit episodes to 20 steps). We train the model for 5,000 episodes with the Adam optimizer29 with a learning rate of 0.0001. We use -greedy together with randomized value functions as a exploration policy, and, as mentioned before, anneal from 1 to 0.01 in a piecewise linear way.

3 Results and Discussion

In these tasks, we demonstrated the effectiveness of our framework on optimizing a molecule to achieve desired properties. We compared MolDQN with the following state-of-the-art baselines:

  • Junction Tree Variational Autoencoder (JT-VAE)

    13 is a deep generative model that maps molecules to a high-dimensional latent space and performs sampling or optimization in the latent space to generate molecules.

  • Objective-Reinforced Generative Adversarial Networks (ORGAN) 15 is a reinforcement learning based molecule generation algorithm that uses SMILES strings for input and output.

  • Graph Convolutional Policy Network (GCPN) 18 is another reinforcement learning based algorithm that operates on a graph representation of molecules in combination with a MDP.

3.1 Single Property Optimization

In this task, our goal is to find a molecule that can maximize one selected property. Similar to the setup in previous approaches18, 13, we demonstrated the property optimization task on two targets: penalized logP and and Quantitative Estimate of Druglikeness (QED)30. LogP is the logarithm of the partition ratio of the solute between octanol and water. Penalized logP13 is the logP minus the synthetic accessibility (SA) score and the number of long cycles.

In this experiment setup, the reward was set to be the penalized logP or QED score of the molecule. For logP optimization, the initial molecule was set to be empty, while for QED optimization, a two-step optimization was used to improve the result. The first step started with an empty molecule, and the second step started with the 5 molecules that have the highest QED values found in step one. We allowed generation of all ring sizes in this task. We run the final models to generate 100 molecules and report the top three property scores found by each model and the percentage of molecules in Table 1. MolDQN outperform others on both tasks. Noticeably, the range of penalized logP is , while the range of QED is . Given that molecules with QED values higher than 0.9 are rare, it is not a trivial task to improve the QED by even 0.01. We also visualized the best molecules we found in Figure 2. Note that in the optimization of penalized logP, the generated molecules are obviously not drug-like, which highlights the importance of carefully designing the reward (including using multiple objectives in a medicinal chemistry setting) when using reinforcement learning.

The better performance of MolDQN compared with the other three baselines can be partly attributed to learning from scratch, where the scope is not limited to the molecules in a specific dataset.

(a) Optimization of penalized logP (b) Optimization of QED
Figure 2: Sample molecules in the property optimization task. (a) Optimization of penalized logP; note that the generated molecules are obviously not drug-like due to the use of a single-objective reward. (b) Optimization of QED.
Penalized logP QED
1st 2nd 3rd Validity 1st 2nd 3rd Validity
random walka -3.99 -4.31 -4.37 100% 0.64 0.56 0.56 100%
JT-VAEb 5.30 4.93 4.49 100% 0.925 0.911 0.910 100%
ORGANb 3.63 3.49 3.44 0.4% 0.896 0.824 0.820 2.2%
GCPNb 7.98 7.85 7.80 100% 0.948 0.947 0.946 100%
MolDQN 11.71 11.63 11.63 100% 0.948 0.948 0.948 100%

a “random walk” is a baseline that chooses a random action for each step.

b values are reported in You et al. 18.

Table 1: Best molecule property scores found by each method.

3.2 Constrained Optimization

We performed molecule optimization under a specific constraint, where the goal is to find a molecule that has the largest improvement compared to the original molecule , while maintaining similarity for a threshold . Here we defined the similarity as the Tanimoto similarity111The Tanimoto similarity uses the ratio of the intersecting set to the union set as the measure of similarity. Represented as a mathematical equation . and represents the number of attributes in each object . is the number of attributes in common. between Morgan fingerprints28 with radius 2 of the generated molecule and the original molecule . Following the experiment in Jin et al. 13, we trained a model in an environment whose initial state was randomly set to be one of the 800 molecules in ZINC dataset which have the lowest

penalized logP value, and evaluated the trained model for each molecule for one epoch. The maximum number of steps per episode was limited to 20 in consideration of computational efficiency. In this task, the reward was designed as follows:

where is the coefficient to balance the similarity and logP. If the similarity constraint is not satisfied, the reward is penalized by the difference between the target and current similarity. In our experiments . We report the success rate (the percentage of molecules satisfying the similarity constraint), and average improvement on logP in Table 2.

Improvement Success Improvement Success Improvement Success
0.0 97.5% 100% 100%
0.2 97.1% 100% 100%
0.4 83.6% 100% 100%
0.6 46.4% 100% 100%

a values are reported in Jin et al. 13.

Table 2:

Mean and standard deviation of property improvement in constraint optimization tasks.

is the threshold of the similarity constraint .

3.3 Multi-Objective Optimization

In drug design, there is often a minimal structural basis that a molecule must retain to bind a specific target, referred to as the molecular scaffold. This scaffold is usually defined as a molecule with removal of all side chain atoms. 31 Often the question arises: can we find a molecule similar to a existing one but having a better performance? We designed the experiment of maximizing the QED of a molecule while keeping it similar to a starting molecule. The multi-objective reward of a molecule was set to be a 2-dimensional vector of , where is the QED score and is the Tanimoto similarity between the Morgan fingerprints of molecule and the original molecule . We also added a hard constraint to the similarity: if the optimized molecule does not contain the same scaffold as the target molecule, the similarity is set to zero.

(a) (b)

Different weights can be applied to denote the priorities of these two objectives. The variable denotes the weight of similarity score, while the QED score is balanced by . This is referred to as a “scalarized” multi-objective optimization strategy:

We trained the model with random objective weight drawn from , and evaluated the model with weight , and . was set to 0.1 during evaluation in order to obtain a distribution of molecules. Figure LABEL:fg:multi_obj_scattera shows the properties of the optimized molecules under different weights. Figure LABEL:fg:multi_obj_scattera demonstrates that we can successfully optimize the QED of a molecule while keeping the optimized molecule similar to the starting molecule. As the weight applied on similarity increases, the optimized molecules have higher similarity to the starting molecule, and larger fractions of the optimized molecules have QED values lower than those of the starting molecules. The same experiment was repeated for 20 molecules randomly selected from ChEMBL32 (Figure S1), and the empirical distribution of the improvement of QED was plotted in Figure LABEL:fg:multi_obj_scatterb. As the weight on similarity increases, the distribution of QED moves leftwards because higher priority is placed on similarity. Finally, we visually examined the optimized molecules (Figure LABEL:fg:multi_obj_scatterc. All molecules optimized possess the same scaffold as the starting molecule, indicating the model trained preserves the scaffold.)

3.4 Optimality vs. Diversity

Related work in this area reports results for two distinct tasks: optimization and generation (or, to avoid ambiguity, property-directed sampling). Optimization is the task to find the best molecule with regard to some objectives, whereas property-directed sampling is the task of generating a set of molecules with specific property values or distributions.

For the results we report in this paper, we note that there is often a trade-off between optimality and diversity. Without the introduction of randomness, execution of our learned policy will lead to exactly one molecule. Alternatively, there are three possible ways to increase the diversity of the molecules generated:

  1. Choose one function uniformly for in to make decision in each episode.

  2. Draw an action stochastically with probability proportional to the -function in each step (as in Haarnoja et al. 33).

  3. During evaluation, use non-zero in the -greedy algorithm (we took this approach in Section 3.3).

All of these strategies are sub-optimal because the policy is no longer pursuing the maximum future rewards. In the results above, we focused primarily on optimization tasks and leave the question of diversity for future work.

We also conducted experiments to illustrate that we are able to find molecules with properties in specific ranges with 100% success (Table S1). In addition, we demonstrated that we can generate molecules that satisfy multiple target values (Table S2). However, because we formulated the property targeting to be an optimization task, it is not fair for us to compare to other generative models that produce diverse distributions of molecules.

3.5 Visualization and Interpretation

Users prefer interpretable solutions when they applying methods that construct new molecules.. Here we demonstrated the decision making process of MolDQN that maximizes the QED, starting from a specific molecule.

(a) (b)

Figure 4: (a) Visualization of the -values of selected actions. The full set of -values of actions are shown in Figure S2. The original atoms and bonds are shown in black while modified ones are colored. Dashed lines denote bond removals. The -values are rescaled to (b) The steps taken to maximize the QED starting from a molecule. The modifications are highlighted in yellow. The QED values are presented under the modified molecules.

In the first step of decision making, the -network predicts the -value of each action. Figure 4a shows the predicted -values of the chosen actions. The full set of -values of for all actions in the first step are shown in Figure S2. We observe that forming the triazine ring is strongly favored, while removing the hydroxyl group is disfavored.

The -value is a measure of future rewards; therefore, it is possible for the algorithm to choose an action that decreases the property value in the short term but can reach higher future rewards. Figure 4b shows a sample trajectory of maximizing the QED of a molecule. In this trajectory, both step 6 and step 8 decrease the QED value of the molecule, but finally lead to a molecule with a QED of 0.937, which represents a 0.323 improvement relative to the starting molecule.

4 Conclusion

By combining state-of-the-art deep reinforcement learning with domain knowledge of chemistry, we developed the MolDQN model for molecule optimization. We demonstrated that MolDQN outperforms several other established algorithms in generating molecules with better specified properties. We also presented a way to visualize the decision making process to facilitate learning a strategy for optimizing molecular design. Future work can be done on applying different -function approximators (for example MPNN34

) and hyperparameter searching. We hope the MolDQN model will assist medicinal and material chemists in molecular design.

The authors thank Zan Armstrong for her expertise and help in visualization of the figures. The authors thank David Belanger and John Platt for the internal review and comments. ZZ and RNZ thank the support from the National Science Foundation under the Data-Driven Discovery Science in Chemistry (D3SC) for EArly concept Grants for Exploratory Research (EAGER) (Grant CHE-1734082).

Supporting Information will be available online.