Following advances in generative modelling for domains such as computer vision and natural language processing, there has been increased interest in applying generative methods to drug discovery. However, such approaches often fail to address numerous technical challenges inherent to molecular design, including accurate molecular reconstruction, efficient exploration of chemical space, and synthetic tractability of generated molecules. Further, these approaches bias the generation of molecules towards the data distribution over which they were trained, restricting their ability to discover truly novel compounds. Previous workyou_graph_2019; zhou_optimization_2019 has attempted to address these issues by framing molecular design as a reinforcement learning problem sutton2018reinforcement, in which an agent learns a mapping from a given molecular state to atoms that can be added to the molecule in a step-wise manner. These approaches generally ensure validity of the generated compounds and avoid the need to learn a latent space mapping from the data. However, they do not address the issue of synthetic tractability, and the proposed atom-by-atom environment transitions prevent rapid exploration of chemical space.
We instead approach the problem in a way that incorporates a favourable bias into the Markov Decision Process. Specifically, we define the environment’s state transitions as sequences of chemical reactions, allowing us to address the common issue of synthetic accessibility. While ensuring synthezability of computationally-generated ligands is challenging, our framework treats synthesizability as a feature rather than as a constraint. Our approach, deemed REACTOR (REACTion-driven Objective Reinforcement), thus addresses a common limitation of existing methods, whereby the synthetic routes for generated molecules are unknown and require challenging retro-synthetic planning. Importantly, the REACTOR framework is able to efficiently explore synthetically-accessible chemical space in a goal-directed manner, while also providing a theoretically-valid synthetic route for each generated compound.
We benchmark our approach against previous methods, focusing on the task of identifying novel ligands for the D2 dopamine receptor (DRD2), a G protein-coupled receptor involved in a wide range of neuropsychiatric and neurodegenerative disorders dopaminereview. In doing so, we find that our approach outperforms previous state-of-the-art methods, is robust to the addition of multiple optimization criteria, and produces synthetically-accessible, drug-like molecules by design.
2 Related Work
Computational drug design has traditionally relied on domain knowledge and heuristic algorithms. Recently, however, several machine learning based generative approaches have also been proposed. Many of these methods, such asguimaraes_objective-reinforced_2018
, take advantage of the SMILES representation using Recurrent Neural Networks (RNNs) but have difficulties generating syntactically valid SMILES. Graph-based approachesde_cao_molgan_2018; liu_constrained_2019; jin_junction_2019 have also been proposed and generally result in improved chemical validity. These methods learn a mapping from molecular graphs to a high-dimensional latent space from which molecules can be sampled and optimised. In contrast, pure reinforcement learning algorithms such as you_graph_2019; zhou_optimization_2019 treat molecular generation as a Markov Decision Process, in which molecules are assembled from basic building blocks such as atoms and fragments. However, a core limitation of existing methods is the forward-synthetic feasibility of proposed designs. To overcome these limitations, button_automated_2019 propose a hybrid rule-based and machine learning approach in which molecules are assembled from fragments under synthetic accessibility constraints in an iterative single-step process. However, this approach is limited in terms of the flexibility of its optimisation objectives, as it only allows for generation of molecules similar to a given template ligand.
In order to have practical value, methods for computational drug design must also make appropriate tradeoffs between molecular generation, which focuses on the construction of novel and valid molecules, and molecular optimization, which focuses on the properties of the generated compounds. While prior work has attempted to address these challenges simultaneously, this can lead to sub-optimal results by favouring either the generation or the optimization tasks. Generative models generally do not scale well to complex property optimization problems, as they attempt to bias the generation process towards a given objective within the latent space while simultaneously optimizing over the reconstruction loss. These objectives are often conflicting, making goal-directed optimization difficult and hard to scale when multiple reward signals are required. This is generally the case in drug design, where drug candidates must show activity against a given target as well as favourable selectivity, toxicity, and pharmacokinetic properties.
In contrast, atom-based reinforcement learning addresses the generative problem via combinatorial enumeration of molecular states zhou_optimization_2019 or a posteriori verification of moleculesyou_graph_2019. These solutions are often slow, and create a bottleneck in the environment’s state transitions that limits effective optimization.
In this work, we decompose generation and optimization by delegating each problem to a distinct component of our computational framework. Specifically, we allow an Environment module to handle the generative process, using known chemistry as a starting point for its design, while an Agent learns to effectively optimize compounds through interactions with this Environment. By disambiguating the responsibilities of each component, and by formalizing the problem as a Markov Decision Processes (MDPs), we allow the modules to work symbiotically, exploring chemical space both more efficiently and more effectively.
We begin with a short overview of Markov Decision Processes and Actor-Critic methods for reinforcement learning before defining our framework in detail.
3.1.1 Markov Decision Processes
A Markov Decision Process (MDP) 10.2307/24900506 is a powerful computational framework for sequential decision-making problems. An MDP is defined via the tuple , where defines the possible states, denotes the possible actions that may be taken at any given time, denotes the reward distribution of the environment, and defines the dynamics of the environment. Interactions within this framework give rise to trajectories of the form , with a terminal time step. Crucially, an MDP assumes that:
where denotes discrete time steps.
This definition states that all prior history of a decision trajectory can be encapsulated within the preceding state, allowing an agent operating within an MDP to make decisions based solely on the current state of the environment. This assumption provides the basis for efficient learning, and holds under our proposed framework. An agent’s mapping from any given state to action probabilities is termed apolicy, and the probability of an action at state is denoted .
3.1.2 Policy Optimization
The underlying objective of a Reinforcement Learning agent operating in an MDP is to optimize its policy to maximize the expected return from the environment, until termination at time , defined for any step by:
where is a discount factor determining the value of future rewards, and the expectation is taken over the experience induced by the policy’s distribution. Several approaches exist for learning a policy that maximizes this quantity. In value-based approaches, Q-values of the form
are trained to estimate the scalar value of action-value pairs as estimates of the expected return. A policy is then derived from these values through strategies such as-greedy control sutton2018reinforcement. Alternatively, policy-based approaches attempt to parameterize the agent’s behaviour directly, for example through a neural network, to produce . While our framework is agnostic to the specific algorithm used for learning, we choose to validate our approach with an actor-critic architecture Konda_Tsitsiklis_2000. This approach combines the benefits of learning a policy directly using a policy network
, with a variance-reducing value network. Specifically, we use a synchronous version of DBLP:journals/corr/MnihBMGLHSK16, which is amenable to high parallelization and further gains in training efficiency. The Advantage Actor-Critic (A2C) objective function at time is given by
Intuitively, maximization of equation 4’s first term involves adjusting the policy parameters to align high probability of an action with high expected return, while the second term serves as an entropy regularizer preventing the policy from converging too quickly to sub-optimal deterministic policies.
3.2 Molecular Design via Synthesis Trajectories
A core insight of our framework is that we can embed knowledge about the dynamics of chemical transitions
into a Reinforcement Learning system for guided exploration. In doing so, we induce a bias over the optimization task which, given its close correspondence with natural molecular transitions, should increase learning efficiency while leading to better performance across a larger, pharmacologically relevant chemical subspace.
We propose embedding this bias into the transition model of an MDP by defining possible transitions as true chemical reactions. This allows us to gain the additional benefit of built-in synthetic accessibility, in addition to immediate access to one possible synthesis route for generated compounds. This is demonstrated in Figure 1, where a sample trajectory is provided by REACTOR for a DRD2-optimized molecule, while a high-level overview of our framework is presented in Figure 2.
3.2.1 Framework Definition
We define each component of our MDP as follows:
We allow for any valid molecule to comprise a state in our MDP. Practically, the state space is defined as , with
a feature extraction function,the space of molecules reachable given a set of chemical reactions, initialization molecules, and available reactants. We use Morgan Fingerprints Rogers_Hahn_2010
with bit-length 2048 and radius 2 to extract feature vectors from molecules. These representations have been shown to provide robust and efficient featurizations, while more computationally-intensive approaches like Graph Neural Networks are yet to demonstrate significant representational benefitCorrection_FP; Kearnes_McCloskey_Berndl_Pande_Riley_2016.
In its general formulation, the action space of our framework is defined hierarchically, enabling the potential application of novel approaches for hierarchical reinforcement learning. Specifically, we define a set of higher-level actions as a curated list of chemical reaction templates, taking the form:
Each corresponds to a reactant, while each is a product of this reaction. We make use of the SMARTS syntax SMARTS to represent these objects as regular expressions. At step , the state thus corresponds to a single reactant in any given reaction. It is necessary to select which molecular blocks should fill in the remaining pieces for a given state and reaction selection. This gives rise to a set of primitive actions corresponding to a list of reactants derived from the reaction templates, which we also refer to as chemical building blocks. In contrast with previous methods you_graph_2019; zhou_optimization_2019, which establish a deterministic start state such as an empty molecule or carbon atom, we initialize our environment with a randomly-sampled building block which matches at minimum one reaction template. This ensures that a trajectory can take place and encourages the learned policies to be generalizable across different regions in chemical space.
For our experiments, we work with two-reactant reaction templates and select missing reactants based on those which will most improve the next state’s reward. We also select the chemical product in this manner when more than one product is generated. Doing so collapses our hierarchical formulation into a standard MDP formulation, with the reaction selection being the only decision point. Additionally, it is likely that for any step , the set of possible reactions is smaller than the full action space. In order to increase both the scalability of our framework (by allowing for larger reaction lists) and the speed of training, we use a mask over unfeasible reactions. This avoids the need for the agent to learn the chemistry of reaction feasibility, and reduces the effective dimension of the action space at each step. The policy then takes the form , with the environment’s masking function. We discuss potential mechanisms to make use of the hierarchical formulation of the action space in Section 5.
Appropriate reward design is crucial given that it drives the policy optimization process. In you_graph_2019, intermediate and adversarial rewards are introduced in order to enforce drug-likeness and validity of generated compounds. In zhou_optimization_2019, these requirements are ignored, and while optimization performance increases, desirable pharmaceutical properties are often lost. In the REACTOR framework, the separation between the agent and the environment allows us to maintain property-focused rewards that guide optimization while ensuring chemical constraints are met via environment design.
We use a deterministic reward function based on the property to be optimized. In 1, this corresponds to the binary prediction of compounds binding to the D2 Dopamine Receptor (DRD2). In 3, these are the penalized calculated octanol-water partition coefficient (cLogP) and quantitative estimate of drug-likeness (QED) qed. In order to avoid artificially biasing our agent towards greedy policies, we remove intermediate rewards and provide evaluative feedback only at termination of an episode. While we feel this is a more principled view on the design process, zhou_optimization_2019 have also suggested that using an intermediate reward discounted by a decreasing function of the step may improve learning efficiency. In 1, we further apply a constraint based on the atom count of a molecule to be consistent with prior work. When molecules exceed the maximum number of heavy atoms (38), the agent observes a reward of zero.
In the template-based REACTOR framework, state transitions are deterministic. We therefore have , according to our choice of reaction and the subsequent reactant-product selection. When modifying the reactant-selection policy, either via a stochastic heuristic such as an epsilon-greedy reactant selection, or learned hierarchical policies, state transitions over the higher level actions become stochastic according to the internal policy’s dynamics.
To validate our framework, we benchmark its performance on goal-directed design tasks, focusing primarily on predicted activity for the D2 Dopamine Receptor. To maintain consistency with experiments done in prior work, we also perform additional experiments on penalized cLogP and QED, with the results presented in Supplementary Material.
In order to better understand the exploration behaviour of our approach, we also investigate the nature of the trajectories generated by the REACTOR policies, showing that policies retain drug-likeness across all optimization objectives, while also exploring distinct regions of chemical space.
4.1 Experimental Setup
Baselines We compare our approach to two recent methods in deep generative molecular modelling, JT-VAE and ORGAN jin_junction_2019; guimaraes_objective-reinforced_2018. Each of these approaches was first pre-trained for up to 48h on the same compute facility, a single machine with 1 NVIDIA Tesla K80 GPU and 16 CPU cores. Property optimization was then performed using the same procedures as described in the original papers. We also compare our method with two state-of-the-art reinforcement learning approaches, Graph-Convolutional Policy Networks and MolDQN you_graph_2019; zhou_optimization_2019
. Each algorithm was run using the open-sourced code from the authors, while we enforced the same reward function implementation across methods to ensure consistency. We ran GCPN using 32 CPU cores for approximately 24 hours (against 8 hours in the original paper), and MolDQN for 20000 episodes (against 5000 episodes in the original paper). In addition, we added a steepest-ascent hill-climbing baseline using the REACTOR environment to demonstrate that for simple, mostly greedy objectives such as cLogP and QED, simple search policies may provide reasonable performance. In contrast, learned traversals of space become necessary for complex tasks such as DRD2.
Evaluation Given the inherent differences between generative and reinforcement learning models, evaluation was adapted to remain consistent within each class of algorithms. JT-VAE and ORGAN were evaluated based on decoded samples from their latent space, using the best results across training checkpoints, with statistics for JT-VAE computed over 3 random seeds. Given the prohibitive cost of training ORGAN, results are given over a single seed. Other baselines were compared based on three sets of 100 building blocks used as starting states. Statistics are reported over sets, while the statistics of the initial states are shown by BLOCKS.
We prioritize evaluation of each method based on mean reward, given that this corresponds to the underlying optimization objectives for reinforcement learning methods. This is denoted by Activity in Table 1, which corresponds to the percentage of generated molecules which are predicted active for the DRD2 receptor. In both Table 1 and Table 3, mean reward was computed based on the set of unique molecules generated by each algorithm, in order to avoid artificially favouring methods which often generate the same molecule. Diversity corresponds to the average pairwise Tanimoto distance among generated molecules, while "Scaff. Similarity" corresponds to the average pairwise similarity between the scaffolds of the compounds, as implemented by moses. Finally, we limited the number of atoms to 38 for all single-objective tasks, as done in prior work zhou_optimization_2019; you_graph_2019; jin_junction_2019, and to 50 for the multi-objective tasks.
4.2 Goal-Directed De Novo Design
|DRD2||BLOCKS||3% 0||0.94 0||N/A||100% 0|
|Hill Climbing||62% 0||0.91 0||0.15 0.01||100% 0|
|JTVAE||11% 16%||0.06 0.08||N/A||3% 2%|
|GCPN||5% 2%||0.92 0||0.13 0||100% 0|
|MolDQN||89% 2%||0.88 0||N/A||97% 1.8%|
|REACTOR||98% 1%||0.71 0.01||0.22 0.02||99.7% 0.5%|
Computation of Scaffold Similarity requires the presence of a ring system, thus the N/A.
Results on the unconstrained design task show that REACTOR achieves highest mean reward for the DRD2 objective. We also observe that REACTOR maintains high diversity and uniqueness in addition to robust performance. This a crucial characteristic, as it implies that the agent is able to optimize the space surrounding each starting molecule, without reverting to the same molecule to optimize the scalar reward signal. In Table 3, REACTOR also achieves higher reward on QED, while remaining competitive on penalized cLogP despite the simplistic nature of this objective favouring atom-by-atom transitions.
Training efficiency is an important practical consideration while deploying methods for de novo design. Generative models first require learning a mapping of molecules to the latent space before training for property optimization. During our experiments, this resulted in more than 48h of training time. Reinforcement learning methods trained faster, but generally failed to converge within 24 hours. In contrast, as shown in Figure 8 (Supplementary Material), our approach converges within approximately two hours of training.
4.3 Synthetic Tractability and Desirability of Optimized Compounds
Given the narrow perspective offered by quantitative benchmarks for molecular design models moses, it is equally important to qualitatively assess the behaviour of these models by examining generated compounds. Figure 4 provides samples generated by each RL method across all objectives. Since the computational estimation of cLogP relies on the Wildman-Crippen method clogp, which assigns a high atomic contribution to Halogens and Phosphorous, the atom-based action space of MolDQN produces samples that are heavily biased towards these atoms, resulting in molecules that are well optimized for the task but neither synthetically-accessible nor drug-like. This generation bias was not observable in previously reported benchmarks where atom types were only limited to Carbon, Oxygen, Nitrogen, Sulfur and Halogens zhou_optimization_2019. Furthermore, MolDQN samples for the DRD2 task lack a ring system, and whereas molecules from GCPN have one, none adequately optimize for the objective.
In contrast, REACTOR appears to produce more pharmacologically desirable compounds, without explicitly considering this as an optimization objective. This is supported by Figure 3, which shows that REACTOR is the only approach able to simultaneously solve the DRD2 task while maintaining favourable distributions for synthetic-accessibility sas and drug-likeness.
Further, as shown in Figure 1 and Figure 7, optimized compounds are provided along with a possible route of synthesis. While such trajectories may not be optimal, given that they are limited by the reward design, the set of reaction templates used, their specificity, as well as the availability and cost of reactants, they provide a crucial indication of synthesizability. In Gao_Coley_2020, the authors detail the lack of consideration for synthetic tractability in current molecular optimization approaches, highlighting that this is a necessary requirement for application of these methods in drug discovery. While alternate ideas aiming to embed synthesizability constraints into generative models have recently been explored chembo; moleculechef; button_automated_2019, REACTOR is the first approach which explicitly addresses synthetic feasibility by optimizing directly in the space of synthesizable compounds using reinforcement learning.
4.4 Multi-Objective Optimization
Practical methods for computational drug design must be robust to the optimization of multiple properties. Indeed, beyond the agonistic or antagonistic effects of a small molecule, properties such as the selectivity, solubility, drug-likeness and permeability of a drug candidate must be considered. To validate the REACTOR framework under this setting, we consider the task of optimizing for selective DRD2 ligands. Dopamine receptors are grouped into two classes: D1-like receptors (DRD1, DRD5) and D2-like receptors (DRD2, DRD3 and DRD4). Although these receptors are well studied, design of drugs selective across subtypes remains a considerable challenge. In particular, as DRD1 and DRD3 share 78% structural similarity in their transmembrane region dopaminereview; dopaminereceptor, it is very challenging to identify small molecules that can selectively bind to and modulate their activity. We therefore assess performance both on selectivity across classes (using DRD1 as off-target) and within classes (using DRD3 as off-target). We then analyze how our framework performs as we increase the number of design objectives. For these experiments, we focus our comparison on MolDQN, as it strongly outperforms other methods on the single-objective tasks. We increase its training to 25,000 episodes and use reward scalarization to combine multiple objectives. Formally, a vector of reward signals is aggregated via a mapping , thus collapsing the multi-objective MDP momdp into a standard MDP formulation. While the simplest and most common approach to scalarization is to use a weighted sum of the individual reward signals, we adopt a Chebyshev scalarization scheme chebyshev, whereby reward signals are aggregated via the weighted Chebyshev metric:
where is a utopian vector, assigns the relative preferences for each objective, and indexes the objectives. For our experiments, we consider binary rewards, such that the utopian point is always , rendering the dynamics of each reward signal more similar, and assign uniform preferences to the objectives. While Chebyshev scalarization was introduced for tabular Reinforcement Learning, we can interpret it in the function of approximation setting as defining an adaptive curriculum, whereby the optimization focus shifts dynamically according to the objective most distant from . In Figure 9, we see that this approach for combining reward signals is significantly more robust as the number of objectives increases.
4.4.1 DRD2 Selectivity
|D2/D1||MolDQN||0.91 0.02||0.83 0||N/A||99% 1%|
|REACTOR||0.92 0.02||0.52 0.03||0.19 0||100% 0|
|D2/D3||MolDQN||0.92 0.02||0.88 0.01||N/A||100% 0|
|REACTOR||0.97 0.02||0.6 0.02||0.23 0.02||100% 0|
Rewards in Table 2 and Figure 5 are computed as the proportion of evaluation episodes for which the algorithms optimize all desired objectives. In Table 2, we find that REACTOR maintains strong performance on the selectivity tasks, optimizing for DRD2 while avoiding off-target activity on the D1 and D3 receptors. Further, it is able to outperform MolDQN while maintaining very low scaffold similarity among generated molecules.
4.4.2 Robustness to Multiple Objectives
In addition to off-target selectivity, we assess the robustness of each method’s performance as we increase the number of pharmacologically-relevant property objectives to optimize. Specifically, we compare the following combinations of rewards:
DRD2 with range-targeted cLogP (2 objectives) according to the Ghose filter ghose
DRD2, range cLogP, and a molecular weight ranging from 180 to 600 (3 objectives)
DRD2, range cLogP, target molecular weight, and drug absorption, as indicated by a model trained on data for the Caco-2 permeability assay caco (4 objectives)
Figure 5 shows that REACTOR demonstrates greater robustness to an increasing number of design objectives. Specifically, it maintains a global success rate of approximately 88% when optimizing for the 4 objective task. Success against DRD2 activity drops more sharply for MolDQN, while the uniqueness of generated molecules drops to below 20% for 3 and 4 objectives.
4.5 Goal-Directed Exploration
In order to gain further insight into the nature of the trajectories generated by the REACTOR agent, we plotted two alternative views of optimization routes generated for the same building block across each single-property objective. In Figure 6
, we fit a Principal Components Analysis (PCA)pca on the space of building blocks to identify the location of the initial state, and subsequently transform the next states generated by the RL agent onto this space. We find that the initial molecule is clearly shifted to distinct regions in space, while the magnitude of the transitions suggest efficient traversal of the space. This provides further evidence that exploration through space is a function of reward design, and is mostly unbiased by the data distribution of initialization states. Figure 7 shows the same trajectories with their corresponding reactions and intermediate molecular states. We find that optimized molecules generally contain the starting structure. We believe this to be a desirable property given that real-life design cycles are often focused on a fixed scaffold or set of core structures. We also note that the policy learned by our REACTOR framework is able to generalize over different starting blocks, suggesting that it achieves generation of structurally diverse and novel compounds.
This work proposes a novel approach to molecular design which defines state transitions as chemical reactions within a reinforcement learning framework. We demonstrate that our framework leads to globally improved performance, as measured by reward and diversity of generated molecules, as well as greater training efficiency while producing more drug-like molecules. Analysis of REACTOR’s robustness to multiple optimization criteria, coupled with its ability to maintain predicted activity on the DRD2 receptor, suggests increased potential for successful application in drug discovery. Furthermore, molecules generated by this framework exhibit better synthetic accessibility by design, with one viable synthesis route also suggested. Although, the reactivity and the stability of the optimized molecules remain a potential issue, REACTOR’s efficiency in a multiple optimization setting suggest their explicit consideration as additional design objectives would resolve that issue.
Future work aims to build on this framework by making use of its hierarchical formulation to guide agent policies both at the higher reaction and lower reactant levels, exploring proposals from hdqn; Bacon_Harb_Precup_2016 as a starting point. We also plan to expand the effective state space of our MDP by embedding a more efficient synthesis model, leveraging insights from transformer, as the MDP transition model. Because practical de novo design requires optimization of multiple criteria simultaneously, we believe the efficiency of our design framework provides a robust foundation for such tasks, and hope to expand on approaches proposed in abels_dynamic_2019; moffaert_multi-objective_2014; yang_generalized_2019. Finally, we intend to validate the bio-activity of generated molecules experimentally to better demonstrate real-world utility.
6 Supplementary Material
Appendix A Results
|Objective||Method||Mean Reward||Max Reward||Diversity||Scaff. Similarity||Uniqueness|
|cLogP||BLOCKS||-1.80 0.08||1.80 0.32||0.94 0||N/A||100% 0|
|Hill Climbing||7.14 0.20||10.90 0.04||0.73 0.01||0.13 0.01||100% 0|
|JTVAE||-1.48 0.56||0.16 0.14||0.54 0.2||N/A||41% 34%|
|GCPN||1.03 0.28||8.51 0.35||0.90 0||0.20 0.01||100% 0|
|MolDQN||12.84 0.23||18.42 0.37||0.71 0.01||0.72 0.23||72% 3.6%|
|REACTOR||8.01 0.18||10.74 0.28||0.69 0.01||0.20 0||99.7% 0.5%|
|QED||BLOCKS||0.523 0.005||0.763 0.009||0.94 0||N/A||100% 0|
|Hill Climbing||0.811 0.007||0.943 0.004||0.879 0.003||0.20 0.023||100% 0|
|JTVAE||0.604 0.017||0.876 0.048||0.841 0.018||0.638 0.046||92.8% 5.5%|
|GCPN||0.607 0.012||0.916 0.012||0.91 0.002||0.112 0.004||100% 0|
|MolDQN||0.857 0.026||0.936 0.004||0.791 0.007||0.620 0.123||67% 5.8%|
|REACTOR||0.876 0.007||0.947 0.001||0.878 0.002||0.161 0.021||100% 0|