DeepAI
Log In Sign Up

Reinforced Genetic Algorithm for Structure-based Drug Design

Structure-based drug design (SBDD) aims to discover drug candidates by finding molecules (ligands) that bind tightly to a disease-related protein (targets), which is the primary approach to computer-aided drug discovery. Recently, applying deep generative models for three-dimensional (3D) molecular design conditioned on protein pockets to solve SBDD has attracted much attention, but their formulation as probabilistic modeling often leads to unsatisfactory optimization performance. On the other hand, traditional combinatorial optimization methods such as genetic algorithms (GA) have demonstrated state-of-the-art performance in various molecular optimization tasks. However, they do not utilize protein target structure to inform design steps but rely on a random-walk-like exploration, which leads to unstable performance and no knowledge transfer between different tasks despite the similar binding physics. To achieve a more stable and efficient SBDD, we propose Reinforced Genetic Algorithm (RGA) that uses neural models to prioritize the profitable design steps and suppress random-walk behavior. The neural models take the 3D structure of the targets and ligands as inputs and are pre-trained using native complex structures to utilize the knowledge of the shared binding physics from different targets and then fine-tuned during optimization. We conduct thorough empirical studies on optimizing binding affinity to various disease targets and show that RGA outperforms the baselines in terms of docking scores and is more robust to random initializations. The ablation study also indicates that the training on different targets helps improve performance by leveraging the shared underlying physics of the binding processes. The code is available at https://github.com/futianfan/reinforced-genetic-algorithm.

READ FULL TEXT VIEW PDF

page 32

page 33

page 34

page 35

06/17/2022

LIMO: Latent Inceptionism for Targeted Molecule Generation

Generation of drug-like molecules with high binding affinity to target p...
03/20/2022

A 3D Molecule Generative Model for Structure-Based Drug Design

We study a fundamental problem in structure-based drug design – generati...
05/21/2022

De novo design of protein target specific scaffold-based Inhibitors via Reinforcement Learning

Efficient design and discovery of target-driven molecules is a critical ...
05/03/2012

An Evolutionary Approach to Drug-Design Using a Novel Neighbourhood Based Genetic Algorithm

The present work provides a new approach to evolve ligand structures whi...
07/09/2020

Identifying efficient controls of complex interaction networks using genetic algorithms

Control theory has seen recently impactful applications in network scien...
10/24/2022

Structure-based Drug Design with Equivariant Diffusion Models

Structure-based drug design (SBDD) aims to design small-molecule ligands...

1 Introduction

Rapid drug discovery that requires less time and cost is of significant interest in pharmaceutical science. Structure-based drug design (SBDD) [bohacek1996art] that leverages the three-dimensional (3D) structures of the disease-related proteins to design drug candidates is one primary approach to accelerate the drug discovery processes with physical simulation and data-driven modeling. According to the lock and key model [tripathi2017molecular], the molecules that bind tighter to a disease target are more likely to expose bioactivity against the disease, which has been verified experimentally [alon2021structures]. As AlphaFold2 has provided accurate predictions to most human proteins [jumper2021highly, varadi2022alphafold], SBDD has a tremendous opportunity to discover new drugs for new targets that we cannot model before [ren2022alphafold].

SBDD could be formulated as an optimization problem where the objective function is the binding affinity estimated by simulations such as docking 

[tripathi2017molecular]. The most widely used design method is virtual screening, which exhaustively investigates every molecule in a library and ranks them. Lyu et al. successfully discovered new chemotypes for AmpC -lactamase and the D dopamine receptor by studying hundreds of millions of molecules with docking simulation [lyu2019ultra]. However, the number of the drug-like molecules is large as estimated to be  [bohacek1996art]

, and it is computationally prohibitive to screen all of the possible molecules. Though machine learning approaches have been developed to accelerate screening 

[graff2021accelerating, gentile2022artificial], it is still challenging to screen large enough chemical space within the foreseeable future.

Instead of naively screening a library, designing drug candidates with generative models has been highlighted as a promising strategy, exemplified by [luo20213d, li2021structure]. This class of methods models the problem as the generation of ligands conditioned on the protein pockets. However, as generative models are trained to learn the distribution of known active compounds, they tend to produce molecules similar to training data [walters2020assessing], which discourages finding novel molecules and leads to unsatisfactory optimization performance.

A more straightforward solution is a combinatorial optimization algorithm that searches the implicitly defined discrete chemical space. As shown in multiple standard molecule optimization benchmarks [brown2019guacamol, huang2021therapeutics, gao2022sample], combinatorial optimization methods, especially genetic algorithms (GA) [jensen2019graph, spiegel2020autogrow4], often perform better than deep generative models. The key to superior performance is GA’s action definition. Specifically, in each generation (iteration), GA maintains a population of possible candidates (a.k.a. parents) and conducts the crossover between two candidates and mutation from a single candidate to generate new offspring. These two types of actions, crossover and mutation, enable global and local traversal over the chemical space, allowing a thorough exploration and superior optimization performance.

However, most GAs select mutation and crossover operations randomly [jensen2019graph]

, leading to significant variance between independent runs. Especially in SBDD, when the oracle functions are expensive molecular simulations, it is resource-consuming to ensure stability by running multiple times. Further, most current combinatorial methods are designed for general-purpose molecular optimization and simply use a docking simulation as an oracle. It is challenging to leverage the structure of proteins in these methods, and we need to start from scratch whenever we change a protein target, even though the physics of ligand-protein interaction is shared. Ignoring the shared information across tasks leads to unnecessary exploration steps and, thus, demands for many more oracle calls, which require expensive and unnecessary simulations 

[tripp2021fresh].

To overcome these issues in the GA method, we propose Reinforced Genetic Algorithm (RGA

), which attempts to reformulate an evolutionary process as a Markov decision process and uses neural networks to make informed decisions and suppress the random-walk behavior. Specifically, we utilize an E(3)-equivariant neural network 

[satorras2021n]

to choose parents and mutation types based on the 3D structure of the ligands and proteins. The networks are pre-trained with various native complex structures to utilize the knowledge of the shared binding physics between different targets and then fine-tuned with a reinforcement learning algorithm during optimizations. We test

RGA’s performance with various disease-related targets, including the main protease of SARS-CoV-2.

The main contributions of this paper can be summarized as follows:

  • [leftmargin=5mm]

  • We propose an evolutionary Markov decision process (EMDP) that reformulates an evolutionary process as a Markov decision process, where the state is a population of molecules instead of a single molecule (Section 3.2).

  • We show the first successful attempt to use a neural model to guide the crossover and mutation operations in a genetic algorithm to suppress random-walk behavior and explore the chemical space intelligently (Section 3.3).

  • We present a structure-based de novo drug design algorithm that outperforms baseline methods consistently through thorough empirical studies on optimizing binding affinity by leveraging the underlying binding physics (Section 4).

2 Related Works

We will discuss the related works on methods of drug design and discuss the advantage of the proposed method over the existing works.

General Molecular Design

. Molecular generation methods offer a promising direction for the automated design of molecules with desired pharmaceutical properties such as synthesis accessibility and drug-likeliness. Based on how to generate or search molecules, these approaches can be categorized into two types, (1) deep generative models (DGMs) imitate the molecular data distribution, including variational autoencoder (VAE) 

[gomez2018automatic, jin2018junction]

, generative adversarial network (GAN) 

[guimaraes2017objective, cao2018molgan], normalizing flow model [shi2019graphaf, luo2021graphdf]

, energy based model 

[liu2021graphebm]; and (2) combinatorial optimization methods directly search over the discrete chemical space, including genetic algorithm (GA) [jensen2019graph, nigam2019augmenting, gao2021amortized], reinforcement learning approaches (RL) [Olivecrona, You2018-xh, zhou2019optimization, jin2020multi, ahn2020guiding], Bayesian optimization (BO) [korovina2020chembo]

, Markov chain Monte Carlo (MCMC) 

[fu2021mimosa, xie2021mars, bengio2021gflownet] and gradient ascent [fu2021differentiable, shen2021deep].

General molecular design algorithms often use general black-box oracle functions, and some are only tested with trivial or self-designed oracles. For example, using penalized octanol-water partition coefficient (LogP) as the oracle function, it grows monotonically with the number of carbons, and thus there exists a trivial policy to optimize LogP. These oracles do not reflect the challenges of real drug discovery, and those algorithms have limited value for pharmaceutical discovery. Recent works are optimizing docking scores to simulate a more realistic discovery scenario [cieplinski2020we, steinmann2021using, tripp2021fresh, yang2021hit], same as our work. However, they are still ignoring the information in the given protein structures that could potentially accelerate the design process. However, the extension to leveraging the structural knowledge is nontrivial.

Structure-based Drug Design. Structure-based drug design (SBDD) could utilize the structural information to guide the design of molecules, which are potentially more efficient in drug discovery tasks but poses additional challenges of how to leverage the structures. Since early 1990s, various SBDD algorithms have been proposed, mostly based on combinatorial optimization algorithms such as tree search [luo1996rasse, gillet1993sprout, pearlman1993concepts]

and evolutionary algorithms 

[douguet2000genetic, durrant2013autogrow]. Those methods typically optimize the ligands in the pockets according to a physical model characterizing the binding affinity. For example, RASSE [luo1996rasse] used a force-field-like scoring function [wang1998score] to evaluate the partial solutions within a tree search. However, obtaining a fast and accurate model to quantify binding free energy itself is still an unsolved challenge.

Recently, generative modeling of 3D molecules conditioned on protein targets is attracting more attention [luo20213d, li2021structure]. Similar to DGMs in general molecular design, those methods learn the atom’s compositional and spatial distribution of native structure of protein-ligand complexes with neural models and design new ligands by complete the complex structure given targets. Deep generative models are end-to-end and data-driven thus surpass the necessity of understanding the physics of interaction. However, as the training objective is to learn the distribution of known active compounds, the models tend to produce molecules close to the training set [walters2020assessing], which is undesired in terms of patentability and leads to unsatisfactory optimization performance.

3 Method

In this paper, we focus on structure-based drug design. The goal is to design drug molecules (a.k.a. ligands) that could bind tightly with the disease-related proteins (a.k.a. targets). Given the 3D structures of the target proteins, including binding site information, docking is a popular computational method for assessing the binding affinity, which can be roughly retrieved as the free energy changes during the binding processes. We present a variant of genetic algorithm that is guided by reinforcement learning and a docking oracle. Next, we will first describe the general evolutionary process used in genetic algorithms (Section 3.1); Then we will present how to model this evolutionary process as a Markov decision process (MDP) where RL framework can be constructed (Section 3.2); After that, we describe the detailed implementation of this MDP framework using multiple policy networks (Section 3.3).

3.1 Evolutionary Process

In this section, we introduce the primary setting of the evolutionary processes. With both optimization performance and synthetic accessibility taken into account [gao2020synthesizability, huang2021therapeutics], we follow the action settings in Autogrow 4.0 [spiegel2020autogrow4]. It demonstrated superior performance over other GA variants in the empirical validation of structure-based drug design [spiegel2020autogrow4], and its mutation actions originated from chemical reactions so that the designed molecules are more likely to be synthesizable. Specifically, an evolutionary processes starts by randomly sampling a population of drug candidates from a library. In each generation (iteration), it carries out (i) crossover between parents selected from the last generation, and (ii) mutation on a single child to obtain the offspring pool. An illustration of both crossover and mutation operations is available in Appendix. Note that we only adopted the action settings from Autogrow 4.0, without using other tricks such as elitism.

Crossover, also called recombination, combines the structure of two parents to generate new children. Following Autogrow 4.0 [spiegel2020autogrow4], we select two parents from the last generation and search for the largest common substructure shared between them. Then we generate two children by randomly switching their decorating moieties, i.e., the side chains attached to the common substructure.

Mutation operates on a single parent molecule and modifies its structure slightly. Following Autogrow 4.0 [spiegel2020autogrow4]

, we adopt transformations based on chemical reactions. Unlike naively defined atom-editing actions, mutation steps based on chemical reactions could ensure all modification is reasonable in reality, leading to a larger probability of designing synthesizable molecules. We included two types of chemical reactions: uni-molecular reactions, which only require one reactant, and bi-molecular reactions, which require two reactants. While uni-molecular reactions could be directly applied to the parent, we sample a purchasable compound to react with the parent when conducting a bi-molecular reaction. In both cases, the parent serves as one reactant, and we use the main product as the child molecule. We use the chemical reactions from

[spiegel2020autogrow4], which was originally from [durrant2013autogrow, hartenfeller2011collection].

Evolution. At the -th generation (iteration), given a population of molecules denoted as , we generate an offspring pool denoted as by applying crossover and mutation operations. Then we filter out the ones with undesirable physical and chemical properties (e.g., poor solubility, high toxicity) in the offspring pool and select the most promising to form the next generation pool ().

Figure 1: We illustrate one generation (iteration) of GA (top) and RGA pipeline (bottom). Specifically, we train policy networks that take the target and ligand as input to make informed choices on parents and mutation types in RGA.

3.2 Evolutionary Markov Decision Process

Next we propose the evolutionary Markov decision process (EMDP) that formulates an evolutionary process of genetic algorithm as a Markov decision process (MDP). The primary purpose is to utilize reinforcement learning algorithms to train networks to inform the decision steps to replace random selections. Taking a generation as a state, Markov property that requires is naturally satisfied by the evolutionary process described above, where denotes the state at the -th generation, which is the population of ligands. We use to denote a ligand. We elaborate essential components for Markov decision process as follows, and the EMDP pipeline is illustrated in Figure 1.

State Space. We define the population at the -step generation, , in the evolutionary process as the state at the -step in an EMDP. A state includes population of candidate molecules (i.e., ligand, denoted ) and their 3D poses docked to the target, fully observable to the RL agent. At the beginning of the EMDP, we randomly select a population of candidate molecules and use docking simulation to yield their 3D poses as the initial state.

Action Space. The actions in an EMDP are to conduct the two evolutionary steps: crossover and mutation, in a population. For each evolutionary step, we need two actions to conduct it. Concretely, crossover () can be divided to two steps: (1) select the first candidate ligand from the current state (population ); (2) conditioned on the first selected candidate , select the second candidate ligand from the remaining candidate ligand set and apply crossover (Section 3.1) to them.

Mutation () can be divided to two steps: (1) select the candidate ligand to be mutated from the current state (population ); (2) conditioned on the selected candidate ligand , select the reaction from the reaction set and apply it to .

As applying the crossover and mutation steps are deterministic, the actions in an EMDP focus on selecting parents and mutation types. Upon finish the action, we could obtain offspring pool, .

State Transition Dynamics. The state transition in an EMDP is identical to the evolution in an evolutionary process. Once we finish the actions and obtain the offspring pool, , we apply molecular quality filters to filter out the ones unlikely to be drug and then select the most promising to form the parent set for the next generation ().

Reward. We define the reward as the binding affinity change (docking score). The actions leading to stronger binding score would be prioritized. As there is no “episode” concept in an EMDP, we treat every step equally.

3.3 Target-Ligand Policy Network

To utilize molecular structures’ translational and rotational invariance, we adopt equivariance neural networks (ENNs) [satorras2021n] as the target-ligand policy neural networks to select the actions in both mutation and crossover steps. Each ligand has a 3D pose that binds to the target protein, and the complex serves as the input of ENN.

Specifically, we want to model a 3D graph , which can be ligand, target, or target-ligand complex. The input feature can be described as , where represents atoms’ categories (the vocabulary set ) and represents 3D coordinates of the atoms. Suppose is the embedding matrix of all the categories of atoms in a vocabulary set , is randomly initialized and learnable, is the hidden dimension in ENN. Each kind of atom corresponds to a row in . We suppose there are atoms, and each atom corresponds to a node in the 3D graph. Node embeddings at the -th layer are denoted as , where , is number of layers in ENN. The initial node embedding embeds the -th node, where

is one-hot vector that encode the category of the

-th atom. Coordinate embeddings at the -th layer are denoted . The initial coordinate embeddings are the real 3D coordinates of all the nodes. The following equation defines the feedforward rules of ENN, for , we have

(1)

where denotes the concatenation of vectors;

are all two-layer multiple layer perceptrons (MLPs) with Swish activation in the hidden layer

[ramachandran2017searching]. At the -th layer, represents the message vector for the edge from node to node ; represents the message vector for node , is the position embedding for node ; is the node embedding for node . are the node embeddings of the -th (last) layer. We aggregate them using sum function as readout function to obtain a representation of the 3D graph, denoted . The whole process is written as .

Crossover Policy Network. We design two policy networks for two corresponding actions in a crossover, as mentioned in Section 3.2. (1) the first action in crossover operation is to select the first parent ligand from the population

. Similar to the first action in mutation operation, we obtain a valid probability distribution over all the available ligands based on target-ligand complex as input feature and ENN as the neural network architecture, the selection probability of the ligand

is , where and denotes target and ligand (including 3D pose), respectively, denotes target-ligand complex. (2) The second action is to select the second parent ligand conditioned on the first parent ligand selected in the first action. Specifically, for ligand in the remaining population set, we concatenate the ENN’s embedding of the target, first parent ligand and the second parent ligand , and feed it into an MLP to estimate a scalar as an unnormalized probability. The unnormalized probabilities for all the ligands in the remaining population set are normalized via softmax function, i.e., . Given two parents ligands, crossover finds the largest substructure that the two parent compounds share and generates a child by combining their decorating moieties. Thus, the generation of child ligands are deterministic, and the probability of the generated ligands is

(2)

Mutation Policy Network. We design two policy networks for two corresponding actions in mutation, as mentioned in Section 3.2. (1) the first action in mutation operation is to select a candidate ligand to be mutated from population . It models the 3D target-ligand complex to learn if there is improvement space in the current complex. Formally, we obtain a valid probability distribution over all the available ligands based on target-ligand complex as input feature and ENN as neural architecture, the selection probability of the ligand is , where denotes target-ligand complex, represents the ENN’s embedding of target-ligand complex. (2) The second action is to select the SMARTS reaction from the reaction set conditioned on the selected ligand in the first action. Specifically, for each reaction, we generate the new ligand , then obtain the embedding of target, first ligand and the new ligand through ENN, concatenate these three embeddings and feed it into an MLP to estimate a scalar as unnormalized probability. The unnormalized probabilities for all the reactions are normalized via softmax function, i.e., , where , is the reaction set. The probability of the generated ligand is

(3)

Policy Gradient. We leverage policy gradient to train the target-ligand policy neural network. Specifically, we consider maximizing the expected reward as objective via REINFORCE [Olivecrona],

(4)

where is defined in Equation (2) and (3) for crossover and mutation, respectively. The whole pipeline is illustrated in Figure 1. To provide a warm start and leverage the structural information, we pretrain the ENNs on 3D target-ligand binding affinity prediction task, where the inputs are the target-ligand complexes, and the outputs are their binding affinity.

4 Experiment

In this section, we briefly describe the experimental setup and results. The Appendix includes more details, including software configuration, implementation details, dataset description & processing, hyperparameter tuning, ablation study, and additional experimental results. The code is available at

https://github.com/futianfan/reinforced-genetic-algorithm.

4.1 Experimental Setup

Docking Simulation. We adopt AutoDock Vina [trott2010autodock] to evaluate the binding affinity. The docking score estimated by AutoDock Vina is called Vina score and roughly characterizes the free energy changes of binding processes in kcal/mol. Thus lower Vina score means a stronger binding affinity between the ligand and target. We picked various disease-related proteins, including G-protein coupling receptors (GPCRs) and kinases from DUD-E [mysinger2012directory] and the SARS-CoV-2 main protease [zhang2021potent] as targets. Please see the Appendix for more information.

Baselines. The baseline methods cover traditional brute-force search methods (Screening), deep generative models (JTVAE and Gen3D), genetic algorithm (GA+D, graph-GA, Autogrow 4.0), reinforcement learning methods (MolDQN, RationaleRL, REINVENT, GEGL), and MCMC method (MARS). Gen3D and Autogrow 4.0 are structure-based drug design methods, while others are general-purpose molecular design methods. Although methods explicitly utilizing target structures are relatively few, we add general-purpose molecular design methods optimizing the same docking oracle scores as ours, which is a common use case, as baselines [jensen2019graph, huang2021therapeutics]. Concretely, Screening mimics high throughput screening via sampling from ZINC database randomly; JTVAE (Junction Tree Variational Auto-Encoder) [jin2018junction] uses a Bayesian optimization on the latent space to indirectly optimize molecules; Gen3D [luo20213d] is an auto-regressive generative model that grows 3D structures atom-wise inside the binding pocket; GA+D [nigam2019augmenting] represents molecule as SELFIES string [krenn2020self] and uses genetic algorithm enhanced by a discriminator neural network; Graph-GA [jensen2019graph] conduct genetic algorithm on molecular graph representation; Autogrow 4.0 [spiegel2020autogrow4] is the state-of-the-art genetic algorithm in structure-based drug design; MolDQN (Molecule Deep Q-Network) [zhou2019optimization] leverages deep Q-value learning to grow molecules atom-wisely; RationaleRL [jin2020multi] uses rationale (e.g., functional groups or subgraphs) as the building block and a policy gradient method to guide the training of graph neural network-based generator; REINVENT [Olivecrona] represent molecules as SMILES string and uses policy gradient based reinforcement learning methods to guide the training of the RNN generator; GEGL (genetic expert-guided learning) [ahn2020guiding] uses LSTM guided by reinforcement learning to imitate the GA exploration; MARS (Markov Molecule Sampling) [xie2021mars] leverages Markov chain Monte Carlo sampling (MCMC) with adaptive proposal and annealing scheme to search chemical space. To conduct a fair comparison, we limit the number of oracle calls to 1,000 times for each method. All the baselines can be run with one-line code using the software (https://github.com/wenhao-gao/mol_opt) in practical molecular optimization benchmark [gao2022sample].

Dataset: we randomly select molecules from ZINC [sterling2015zinc] database (around 250 thousands drug-like molecules) as 0-th generation of the genetic algorithms (RGA, Autogrow 4.0, GA+D). ZINC also serves as the training data for pretraining the model in JTVAE, REINVENT, RationaleRL, etc. We adopt CrossDocked2020 [francoeur2020three] dataset that contains around 22 million ligand-protein complexes as the training data for pretraining the policy neural networks, as mentioned in Section 3.3. More descriptions are available in Appendix.

Metrics

. The selection of evaluation metrics follows recent works in molecule optimization 

[jin2018junction, nigam2019augmenting, jin2020multi, xie2021mars] and structure-based drug design [spiegel2020autogrow4, luo20213d, huang2021therapeutics]. For each method, we select top-100 molecules with the best docking scores for evaluation and consider the following metrics: TOP-1/10/100 (average docking score of top-1/10/100 molecules): docking score directly measures the binding affinity between the ligand and target and is the most informative metric in structure-based drug design; Novelty (Nov) (% of the generated molecules that are not in training set); Diversity (Div) (average pairwise Tanimoto distance between the Morgan fingerprints); We also evaluate some simple pharmaceutical properties, including quantitative drug-likeness (QED) and synthetic accessibility (SA). QED score indicates drug-likeliness ranging from 0 to 1 (higher the better). SA score ranges from 1 to 10 (lower the better). All the evaluation functions are available at Therapeutics data commons (TDC, https://tdcommons.ai/fct_overview[huang2021therapeutics, Huang2022artificial].

4.2 Results

Method TOP-100 TOP-10 TOP-1 Nov Div QED SA
screening -9.3510.643 -10.4330.563 -11.4000.630 0.00.0% 0.8580.005 0.6780.022 2.6890.077
MARS -7.7580.612 -8.8750.711 -9.2570.791 100.00.0% 0.8770.001 0.7090.008 2.4500.034
MolDQN -6.2870.396 -7.0430.487 -7.5010.402 100.00.0% 0.8770.009 0.1700.024 5.8330.182
GEGL -9.0640.920 -9.910.990 -10.451.040 100.00.0% 0.8530.003 0.6430.014 2.990.054
REINVENT -10.1810.441 -11.2340.632 -12.0100.833 100.00.0% 0.8570.011 0.4450.058 2.5960.116
RationaleRL -9.2330.920 -10.8340.856 -11.6421.102 100.00.0% 0.7170.025 0.3150.023 2.9190.126
JTVAE -9.2910.702 -10.2420.839 -10.9631.133 98.00.027% 0.8670.001 0.5930.035 3.2220.136
Gen3D -8.6860.450 -9.2850.584 -9.8320.324 100.00.0% 0.8700.006 0.7010.016 3.4500.120
GA+D -7.4870.757 -8.3050.803 -8.7600.796 99.20.011% 0.8340.035 0.4050.024 5.0240.164
Graph-GA -10.8480.860 -11.7020.930 -12.3021.010 100.00.0% 0.8110.037 0.4560.067 3.5030.367
Autogrow 4.0 -11.3710.398 -12.2130.623 -12.4740.839 100.00.0% 0.8520.011 0.7480.022 2.4970.049
RGA (ours) -11.8670.170 -12.5640.287 -12.8690.473 100.00.0% 0.8570.020 0.7420.036 2.4730.048
RGA - pretrain -11.4430.219 -12.4240.386 -12.4350.654 100.00.0% 0.8540.035 0.7500.034 2.4940.043
RGA - KT -11.4340.169 -12.4370.354 -12.5020.603 100.00.0% 0.8530.028 0.7380.034 2.5010.050
Table 1:

The summarized performance of different methods. The mean and standard deviation across targets are reported. Arrows (

, ) indicate the direction of better performance. For each metric, the best method is underlined and the top-3 methods are bolded. RGA-pretrain and RGA-KT are two variants of RGA that without pretraining and without training on different target proteins, respectively.

Stronger Optimization Performance. We summarized the main results of the structure-based drug design in Table 1. We evaluate all the methods on all targets and report each metric’s mean and standard deviations across all targets. Our result shows RGA achieves the best performance in TOP-100/10/1 scores among all methods we compared. Compared to Autogrow 4.0, RGA’s better performance in docking score demonstrates that the policy networks contribute positively to the chemical space navigation and eventually help discover more potent binding molecules. On the other hand, including longer-range navigation steps enabled by crossover leads to superior performance than other RL methods (REINVENT, MolDQN, GEGL and RationaleRL) that only focus on local modifications. In addition, we also observed competitive structure quality measured by QED () and SA_Score () in Autogrow 4.0 and RGA without involving them as optimization objectives, thanks to the mutation steps originating from chemical reactions. We visualize two designed ligands with optimal affinity for closer inspection in Figure 2(a) and 2(b), and find both ligands bind tightly with the targets.

Suppressed Random-Walk Behavior. Especially in SBDD, when the oracle functions are expensive molecular simulations, robustness to random seeds is essential for improving the worst-case performance of algorithms. One of the major issues in traditional GAs is that they have a significant variance between multiple independent runs as they randomly select parents for crossover and mutation types. To examine this behavior, we run five independent runs for RGA, Autogrow 4.0 and graph-GA (three best baselines, all are GA methods) on all targets and plot the standard deviations between runs in Figure 2(c) and 2(d). With policy networks guiding the action steps, we observed that the random-walk behavior in Autogrow 4.0 was suppressed in RGA, indicated by the smaller variance. Especially in the later learning phase (after 500 oracle calls), the policy networks are fine-tuned and guide the search more intelligently. This advantage leads to improved worst-case performance and a higher probability of successfully identifying bioactive drug candidates with constrained resources.

Knowledge Transfer Between Protein Targets. To verify if RGA benefited from learning the shared physics of ligand-target interaction, we conducted an ablation study whose results are in the last two rows of Table 1. Specifically, we compare RGA with two variants: (1) RGA-pretrain that does not pretrain the policy network with all native complex structures in the CrossDocked2020; (2) RGA-KT (knowledge transfer) that fine-tune the networks with data of individual target independently. We find that both strategies positively contribute to RGA on TOP-100/10/1 docking score. These results demonstrate the policy networks successfully learn the shared physics of ligand-target interactions and leverage the knowledge to improve their performance.

(a) Example of 7l11, -10.8 kcal/mol
(b) Example of 3eml, -13.2 kcal/mol
(c) TOP-100 vs # oracle call
(d) Score bar for different runs
Figure 2: (a) and (b): Example of ligand poses (generated by RGA) and binding sites of target structures. Example of 7l11: the PDB ID of target is 7l11, which is SARS-COV-2(2019-NCOV) main protease, the Vina score is -10.8 kcal/mol. Example of 3eml: the PDB ID is 3eml, which is a human A2A Adenosine receptor, the Vina score is -13.2 kcal/mol. (c) and (d): studies of suppressed random-walk behavior. (c) reports TOP-100 docking score as a function of oracle calls. The results are the means and standard deviations of 5 independent runs. (d) shows the bars of TOP-100 docking score for various independent runs.

5 Conclusion

In this paper, we propose Reinforced Genetic Algorithm (RGA) to tackle the structure-based drug design problem. RGA reformulate the evolutionary process in genetic algorithms as a Markov decision process called evolutionary Markov decision process (EMDP) so that the searching processes could benefit from trained neural models. Specifically, we train policy networks to choose the parents to crossover and mutate instead of randomly sampling them. Further, we also leverage the common physics of the ligand-target interaction and adopt a knowledge-transfer strategy that uses data from other targets to train the networks. Through empirical study, we show that RGA has strong and robust optimization performance, consistently outperforming baseline methods in terms of docking score.

Though we adopted mutations originating from chemical reactions and the structural quality metrics seem good, we need to emphasize that the designed molecules from RGA do not guarantee synthesizability [gao2020synthesizability], as the crossover operations may break inheriting synthesizability. Directly working on synthetic pathways could solve the problem [gao2021amortized, bradshaw2020barking], but the extension is not trivial. As for future direction, we expect to theoretically analyze the EMDP formulation and the performance of RGA.

Acknowledgement

T.F. and J.S. were supported by NSF award SCH-2205289, SCH-2014438, IIS-1838042, NIH award R01 1R01NS107291-01. W.G. and C.C. were supported by the Office of Naval Research under grant number N00014-21-1-2195 and the Machine Learning for Pharmaceutical Discovery and Synthesis consortium. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Office of Naval Research.

References

Appendix A Mathematical Notation

For ease of exposition, we list the mathematical notations in Table 2. All the mathematical notations are divided into three parts: (1) notation for genetic algorithm (Section 3.1); (2) notation for equivariance neural networks (ENN) [satorras2021n] (Section 3.3); (3) notations for policy network (Section 3.3).

Notations Descriptions
ligand (drug molecule, including 3D pose)
target (target protein related to the disease)
the state (population of molecule) at the -th generation.
offspring pool at the -th generation.
the number of molecule in the state, i.e., size of population.
the first/second parent molecule in the crossover.
the first/second child molecule in the crossover.
parent molecule in the mutation
child molecule in the mutation
the selection reaction in the mutation
the reaction set (library) for mutation
ENN equivariance neural networks [satorras2021n]
vocabulary set of atoms
3D structure
categories of all the atoms
one-hot vector that encode category of -th atom
3D coordinates of the atoms
the embedding matrix of all the categories of atoms
the hidden dimension in ENN.
number of atoms in the input of ENN.
number of layers in ENN
index of layer in ENN
MLP multiple layer perceptrons
two-layer MLP in ENN with Swish activation [ramachandran2017searching] in hidden layer
the concatenation of vectors
initial coordinate embeddings, real 3D coordinates of all the nodes.
Node embeddings at the -th layer
The initial node embedding that embeds the -th node
Coordinate embeddings at the -th layer
message vector for the edge from node to node at -th layer
message vector for node at -th layer
the position embedding for node at -th layer
the node embedding for node at -th layer
ENN representation of the 3D graph (Equation 1)
probability to select the first parent molecule in crossover
probability to select the second parent molecule in crossover
probability of two generated child molecules in crossover (Eq 2)
probability to select the parent molecule in mutation
probability to select the reaction in mutation
probability of generated child molecule in mutation (Eq 3)
Table 2: Mathematical Notations. All the mathematical notations are divided into three parts: (1) notation for genetic algorithm (Section 3.1); (2) notation for equivariance neural networks (ENN) [satorras2021n] (Section 3.3); (3) notations for policy network (Section 3.3).

Appendix B Illustration of Genetic Algorithm

Figure 3 provides two examples to illustrate crossover and mutation operations in genetic algorithm described in Section 3.1.

Crossover, also called recombination, combines the structure of two parents to generate new children. Following Autogrow 4.0 [spiegel2020autogrow4], as shown in Figure 3(a), we select two parents from the last generation and search for the largest common substructure shared between them. Then we generate two children by randomly switching their decorating moieties, i.e., the side chains attached to the common substructure.

Mutation operator performs an in silico chemical reaction to generate an altered child compound (i.e., product in the chemical reaction) derived from a parent (reactant in the chemical reaction), as shown in Figure 3(b). The chemical reaction here contains two reactants, one is parent molecule, another is from reaction set. The reaction set is generated via merging two public reaction libraries: (1) the AutoClickChemRxn set (36 reactions) [durrant2013autogrow] and (2) RobustRxn set (58 reactions [hartenfeller2011collection]). Each reaction in reaction set contains a SMARTS string based reaction template and a reactant. It uses SMARTS reaction template, together with RDKit [landrum2006rdkit], to perform chemical mutations efficiently. The process is written as , where is selected reaction (with a reaction template and another reactant). The ligand to be mutated and the reaction used for mutation are both randomly selected from previous generation and reaction set , respectively. Compared with the mutation operator in conventional GA that randomly flipping an arbitrary bit, the reaction-based mutation enhance synthesizability of the generated molecules [spiegel2020autogrow4]. Mutation operator performs an in silico chemical reaction to generate an altered child compound (i.e., product in the chemical reaction) derived from a parent (reactant in the chemical reaction). The chemical reaction here contains two reactants, one is parent molecule, another is from reaction set.

(a) crossover
(b) mutation
Figure 3: Illustration of GA operations: (a) crossover finds the largest substructure that the two parent compounds share and generates a child by combining their decorating moieties. (b) mutation: given a reactant (i.e., parent), mutation operator uses SMARTS-reaction template (with another reactant) to performs an in silico chemical reaction to generate child compound (i.e., product).

Appendix C Baseline Setup

In this section, we describe the experimental setting for baseline methods. Most of the settings follow the original papers.

  • GA+D (genetic algorithm enhanced by discriminator neural network) [nigam2019augmenting]

    utilizes SELFIES string as the representation of molecules, thus guaranteeing the 100% chemical validity of the generated molecules. Following their original paper, the discriminator neural network is a two-layer fully connected neural network with ReLU activation and sigmoid output layer. The hidden size is 100. the size of the output layer is 1. The input feature of discriminator neural network is a vector of chemical and geometrical properties characterizing the molecules. The population size is set to 300. Maximal generation number is set to 1000. The patience is set to 5. When the property does not improve when the patience exhausts, the process early stops. We used Adam optimizer with 1e-3 as the initial learning rate. beta (

    ) is the weight of discriminator neural network’s score in fitness evaluation, which is used to select most promising molecules in each generation. We set .

  • Graph-GA [jensen2019graph] uses molecular graph to represent molecules and uses crossover and mutation operations to edit the molecular graph. After tuning, the size of population is set to 120. The size of offspring is set to 70. The mutation rate is set to 0.067. Graph-GA do not have learnable parameters and is easy to implement. We use the implementation in GuacaMol [brown2019guacamol].

  • MolDQN (Molecule Deep Q-Networks) [zhou2019optimization]

    uses molecular graph to represent molecules, formulate the molecule optimization process as a Markov Decision Process (MDP) and utilize Deep Q-value learning to optimize it. It grows molecular graph atom-wise, that is, in each episode, it adds one atom to the partially generated molecular graph. The reward function is the negative value of the Vina score. Following the original paper, maximal step in each episode is 40. Each step calls oracle once. The discount factor is 0.9. Deep Q-network is a multilayer perceptron (MLP) whose hidden dimensions are 1024, 512, 128, 32, respectively. The model size is 6.4 M. The input of the Q-network is the concatenation of the molecule feature (2048-bit Morgan fingerprint, with a radius of 3) and the number of left steps. Adam is used as an optimizer with 1e-4 as the initial learning rate. Only rings with a size of 5 and 6 are allowed. It leverages

    -greedy together with randomized value functions (bootstrapped-DQN) as an exploration policy, is annealed from 1 to 0.01 in a piecewise linear way.

  • RationaleRL [jin2020multi]

    . The architecture of the generator is a message-passing network (MPN) followed by MLPs applied in breadth-first order. The generator is pre-trained on general molecules combined with an encoder and then fine-tuned to maximize the reward function using policy gradient. The encoder and decoder MPNs both have hidden dimensions of 400. The dimension of the latent variable is 50. Adam optimizer is used on both pre-training and fine-tuning with initial learning rates of 1e-3, 5e-4, respectively. The annealing rate is 0.9. We pre-trained the model with 20 epochs.

  • MARS [xie2021mars] leverage Markov chain Monte Carlo sampling (MCMC) on molecules with an annealing scheme and an adaptive proposal. The proposal is parameterized by a graph neural network, which is trained on MCMC samples. We follow most of the settings in the original paper. The message passing network has six layers, where the node embedding size is set to 64. Adam is used as an optimizer with 3e-4 initial learning rate. To generate a basic unit, top-1000 frequent fragments are drawn from ZINC database [sterling2015zinc] by enumerating single bonds to break. During the annealing process, the temperature would gradually decrease to 0.

  • Autogrow 4.0 [spiegel2020autogrow4] is the base model for RGA and have been briefly described in Section 3.1. The setup of Autogrow is the same as RGA for fair comparison, the only difference is that RGA use policy network to guide the selection of ligands and reaction for crossover and mutation while Autogrow randomly selects them. The reaction set is generated via merging two public reaction libraries: (1) the AutoClickChemRxn set (36 reactions) [durrant2013autogrow] and (2) RobustRxn set (58 reactions [hartenfeller2011collection]). In each generation, It generates 200 offspring (100 from crossover and 100 from mutation) and keep 50 most promising (with lowest Vina scores) ones for the next generation.

  • Screening exhaustively searches the ZINC database [sterling2015zinc] within oracle budget. It is the traditional high-throughput screening approach.

  • REINVENT [Olivecrona]

    is a reinforcement learning approach, represent molecule as SMILES string and uses recurrent neural network to model SMILES string. It pretrains a prior model using molecules on ZINC and finetune the model using the reward function. It uses REINFORCE to maximize the expected reward function. The learning rate is set to 0.0005; the batch size is set to 64; The hyperparameter

    weighs the pretrained prior model and the reward function, and is set to 60. The model size is 16.3M.

  • JTVAE [jin2018junction] build a junction tree to represent molecule via using substructure (either ring or atom) to represent molecule. It uses both molecular graph-leven and junction tree-level encoder and decoder. The VAE model is pretrained on ZINC databases. Then Bayesian Optimization is used to optimize the docking score on the continuous latent space. We use “botorch”, the python’s Bayesian optimization package, to implement the Bayesian optimization process. It has 703 substructures in vocabulary, extracted from ZINC. The hidden size is 450. The latent size of VAE is set to 56. The model size is 21.8 M.

  • Gen3D [luo20213d] uses 3D deep generative models and grow the molecule via adding atoms auto-regressively. It train a universal model for all the targets. The number of message passing layers in context encoder is 6, and the hidden dimension is 256. We train the model using the Adam optimizer at learning rate 0.0001. The model size is 17.4 M.

  • GEGL (Genetic Expert Guided Learning) [ahn2020guiding] uses LSTM (guided by RL agent) to imitate GA process, however, it is unable to inherit the GA’s flexible assembling manner due to the auto-regressive essence of LSTM. It use Adam as optimizer with initial learning rate 1e-3. The batch size during sampling is 512, the batch size during optimization is 256. In GA, mutation rate is 0.01. The similarity threshold is 0.4, which constrain the similarity between the original molecule and the edited molecule. The maximal SMILES length is set to 120.

Appendix D Additional Experimental Setup

d.1 Docking Simulation

Molecular docking is a computational method which predicts the preferred orientation of one molecule to a second when a ligand and a target are bound to each other to form a stable complex. Knowledge of the preferred orientation in turn may be used to predict the strength of association or binding affinity between two molecules using, for example, scoring functions. We adopt AutoDock Vina [trott2010autodock] to evaluate the binding affinity. The docking score estimated by AutoDock Vina is called Vina score and roughly characterizes the free energy changes of binding processes in kcal/mol. Vina score is usually smaller than 0 and lower Vina score means a stronger binding affinity between the ligand and target. We leverage the negative value of the docking score as reward function (Equation 4).

d.2 Dataset

In this paper, we use ZINC [sterling2015zinc] database and CrossDocked2020 [francoeur2020three] dataset. ZINC is a free database of 250 thousands commercially-available drug-like chemical compounds for virtual screening [sterling2015zinc]. We randomly select molecules from ZINC [sterling2015zinc] database (around 250 thousands drug-like molecules) as 0-th generation of the genetic algorithms (RGA, Autogrow 4.0, graph-GA GA+D). Other baseline methods also use ZINC to either pretrain the models, e.g., JTVAE, REINVENT, RationaleRL or provide searching database, e.g., screening. We adopt CrossDocked2020 [francoeur2020three] dataset that contains around 22 million ligand-protein complexes as the training data for pretraining the policy neural networks, as mentioned in Section 3.3.

Regarding the target proteins, we picked various disease-related proteins, including G-protein coupling receptors (GPCRs) and kinases from DUD-E [mysinger2012directory] and the SARS-CoV-2 main protease [zhang2021potent] as targets. for all the selected target protein, the binding pocket size for all the targets are set to (15.0, 15.0, 15.0). The units of coordinate are Angstrom ( m). Detailed descriptions of these targets are available at https://www.rcsb.org/.

d.3 Evaluation metrics

We leverage the following evaluation metrics to measure the optimization performance:

  • Novelty is the fraction of the generated molecules that do not appear in the training set.

  • Diversity of generated molecules is defined as the average pairwise Tanimoto distance between the Morgan fingerprints [You2018-xh, jin2020multi, xie2021mars].

    (5)

    where is the set of generated molecules that we want to evaluate. is the Tanimoto similarity between molecule and , where (Tanimoto) Similarity measures the similarity between the input molecule and generated molecules. It is defined as , is the binary Morgan fingerprint vector for the molecule . In this paper, it is a 2048-bit binary vector.

  • QED represents a quantitative estimate of drug-likeness. QED score ranges from 0 to 1. It can be evaluated by the RDKit package (https://www.rdkit.org/).

  • SA (Synthetic Accessibility) score measures how hard it is to synthesize a given molecule, based on a combination of the molecule’s fragments contributions [ertl2009estimation]. It is evaluated via RDKit [landrum2006rdkit]. The raw SA score ranges from 1 to 10. A higher SA score means the molecule is hard to be synthesized and is not desirable.

  • Run Time. Unlike optimizing some simple oracles such as QED and LogP scores, the docking simulation need to search 3D molecular conformation docked in the target, which is computationally expensive. Thus run time is an important metric to measure the efficiency of the methods.

Appendix E Implementation Details

e.1 Software/Hardware Configuration

We implemented RGA

using Pytorch 1.10.2, Python 3.7, RDKit v2020.09.1.0 on an Intel Xeon E5-2690 machine with 256G RAM and NVIDIA Pascal Titan X GPUs.

e.2 Hyperparameter Setup

The neural architectures of policy networks are E(3)-equivariant neural network (ENN) [satorras2021n]. The vocabulary set . In ENN, the number of layers is set to 3, i.e., ; the hidden dimension is set to 100, i.e., . In Equation 1, are two-layer MLP in ENN with Swish activation [ramachandran2017searching] in hidden layer. Summation function is used as aggregation function to aggregate the last-layer’s node embedding into graph-level embedding. All the atoms that within the binding site are used as the input of ENN. REINFORCE is used to implement policy gradient [peters2010policy, Olivecrona]. Adam is utilized as optimizer with learning rate for both crossover and mutation policy networks. The reaction set is generated via merging two public reaction libraries: (1) the AutoClickChemRxn set (36 reactions) [durrant2013autogrow] and (2) RobustRxn set (58 reactions [hartenfeller2011collection]). We use RDKit [landrum2006rdkit] to perform in silico chemical reaction based on SMARTS reaction template, then we have . In each generation, we generate up to 200 offspring (100 from crossover and 100 from mutation) and keep 50 most promising (with lowest Vina scores) ones for the next generation, i.e., .

e.3 Code Repository

The code repository is uploaded in supplementary material for reproducibility.

Appendix F Additional Experiment

f.1 Efficiency Study

As mentioned before, unlike optimizing some simple oracles such as QED and LogP scores, the docking simulation need to search 3D molecular conformation docked in the target, which is time-consuming. Thus run time is an important measurement to evaluate the efficiency of the methods. We report the bar of run time over different targets for all the compared methods in Figure 4. The run times varies greatly over different methods. Thus, for ease of visualization, we divided all the methods into two groups. One is slow group, containing 5 methods: screening, GEGL, REINVENT, RationaleRL and JTVAE, where all the methods take more than 10 hours. Another is fast group (10 hours), containing 7 methods, MARS, MolDQN, Gen3D, GA+D, GraphGA, Autogrow, and RGA. We find that both Autogrow and RGA are efficient compared with other methods. This attributes to the unique design of genetic algorithm (both crossover and mutation operations) and the usage of filter after GA operators, as described in Section 3.1. RGA is only slightly slower than Autogrow because it requires additional computation to pretrain/train the policy neural networks.

(a) Slower Group (10 hours)
(b) Fast Group (10 hours)
Figure 4: Efficiency evaluation measured by run time for all the methods. The unit of run time is hours. Due to the big variance in run time, for ease of visualization, we divide all the methods into two groups. One is slow group, containing 5 methods: screening, GEGL, REINVENT, RationaleRL and JTVAE. Another is fast group, containing 7 methods, MARS, MolDQN, Gen3D, GA+D, GraphGA, Autogrow, and RGA.

f.2 Pretraining Equivariant neural network

Pretraining equivariant neural network is a crucial step to RGA. We adopt CrossDocked2020 [francoeur2020three] dataset that contains around 22 million ligand-protein complexes as the training data for pretraining the policy neural networks, as mentioned in Section 3.3

. We split the whole dataset into training/validation dataset with ratio of 9:1. Each data point is a target-ligand complex and the binding affinity (scalar). We report the validation loss of ENN on target-ligand binding affinity prediction task. The validation loss function is root mean square error (RMSE). We find the learning process converges rapidly when passing 150K data points (within an epoch) in terms of validation RMSE loss.

Figure 5: Learning curve of pretraining Equivariant neural network based on target-ligand binding affinity prediction. We plot the root-mean-square error (RMSE) loss as a function of number of passed data. We use early stop strategy to terminate the learning process earlier when the validation loss would not decrease to save computational resource and avoid overfitting. We found the learning process would converges when passing 150K data samples (within an epoch), RMSE loss decreases from more than 8 to less than 1.

f.3 Additional Ablation Study

To further understand our model and GA process, we conduct an ablation study to investigate the impact of each component/strategy to optimization performance. Specifically, we consider the following four variants of RGA. RGA-pretrain is a variant of RGA that does not pretrain the policy neural network. RGA-KT (Knowledge Transfer) is a variant of RGA that does not training policy neural network on different target proteins, i.e., optimizing ligand for one target at a time. RGA-MU (mutation) is a variant of RGA that does not involve mutation operation in GA. That is, all the ligands are generated via crossover operator. Correspondingly, RGA-CO (crossover) is a variant that does not use crossover operation in GA, which means no mutation operator. The results are reported in Table 3. We find that removing either component/strategy will cause a drop in optimization performance (i.e., increase in TOP-100/10/1 scores). Both crossover and mutation are critical to the optimization performance. Also, both pretraining the policy networks and knowledge transfer between different target have positive contribution to the performance. The ablation study furtherly validates the effectiveness of the proposed RGA method.

Method TOP-100 TOP-10 TOP-1 Nov Div QED SA
RGA (full) -11.8670.170 -12.5640.287 -12.8690.473 100.00.0% 0.8570.020 0.7420.036 2.4730.048
RGA - pretrain -11.4430.219 -12.4240.386 -12.4350.654 100.00.0% 0.8540.035 0.7500.034 2.4940.043
RGA - KT -11.4340.169 -12.4370.354 -12.5020.603 100.00.0% 0.8530.028 0.7380.034 2.5010.050
RGA - MU -10.9190.166 -11.1350.362 -11.7470.455 100.00.0% 0.8120.032 0.7020.050 2.9700.048
RGA - CO -9.8660.169 -10.3200.296 -10.7930.501 100.00.0% 0.7370.048 0.7480.067 2.4670.034
Table 3: Ablation studies. Arrows (, ) indicate the direction of better performance. For each metric, the best method is underlined. RGA-pretrain is a variant of RGA that does not pretrain the policy neural network. RGA-KT (Knowledge Transfer) is a variant of RGA that does not training policy neural network on different target proteins, i.e., optimizing ligand for one target at a time. RGA-MU (mutation) is a variant of RGA that does not involve mutation operation in GA. That is, all the ligands are generated via crossover operator. Correspondingly, RGA-CO (crossover) is a variant that does not use crossover operation in GA, which means no mutation operator. Via comparing the results with RGA (full) in the first line, we observe that removing either component would cause a drop in optimization performance (i.e., increase in TOP-100/10/1 scores).

f.4 Scalability

As mentioned before, the population size in RGA is . Then we analyze the computational complexity within each generation. There are two operations, including crossover and mutation operations, as described in Section 3.3.

For crossover, the first step is to select the first parent molecule, we need to evaluate the probability over all the molecules in current population, whose complexity is . The second step is to select the second parent molecule based on the first parent molecule, we need to evaluate the probability over all the remaining molecules in current population, as shown in Equation (2), whose complexity is . The complexity of crossover operation is .

On the other hand, for mutation operation, the first step is to select the parent molecule (only one parent) via evaluating the probability over all the molecules in the current population, i.e., , whose complexity is . The second step is to select a mutated molecules (generated by chemical reaction), as shown in Equation (3). The complexity is , where is the reaction set (see Table 2). The complexity of mutation operation is .

To summarize, the complexity of RGA is . That is, RGA scales linearly in population size and the size of the chemical reaction set . As mentioned in Section E.2, , . Thus, RGA owns desired scalability and is not computational expensive.

f.5 Significance Studies

In this section, we present the significance studies. Specifically, we conduct the hypothesis testing to show the statistical significance of our method over the other GA methods. We compare the significance of the improvement over other GA methods (including AutoGrow4, Graph-GA, GA+D, and GEGL) via running 5 independent trials with different random seeds and then evaluating the p-value. We consider the top-100 and top-10 score as the major metrics, which are the most important metrics for optimization performance. We compare RGA versus AutoGrow 4, Graph-GA and GA+D. The results (p-value) are shown in Table 4. We find almost all the p-values are less than 0.05 (except one value), which indicates that the improvements of RGA over other GA methods are statistically significant.

p-value on TOP-100 p-value on TOP-10 p-value on TOP-1
RGA v.s. AutoGrow 4 0.002 0.005 0.07
RGA v.s. Graph-GA 0.003 0.010 0.046
RGA v.s. GA+D 1.0e-7 5.0e-5 3e-4
RGA v.s. GEGL 2.5e-4 3.7e-3 5.0e-3
Table 4: Results of hypothesis testing. We conduct hypothesis testing to show the statistical significance of our method over other GA-based methods. Specifically, we compare the significance of the improvement over GA methods (AutoGrow4, Graph-GA, GA+D, GEGL) via running 5 independent trials with different random seeds and then evaluating p-value. We consider top-100/10/1 scores, the most important metrics for optimization performance.
(a) -12.8 kcal/mol
(b) -12.7 kcal/mol
(c) -12.7 kcal/mol
(d) -12.7 kcal/mol
(e) -12.6 kcal/mol
(f) -12.6 kcal/mol
(g) -12.6 kcal/mol
(h) -12.6 kcal/mol
Figure 6: Example of ligand poses (generated by RGA) and binding sites of target structures “2rgp”.
(a) -13.4 kcal/mol
(b) -13.4 kcal/mol
(c) -13.4 kcal/mol
(d) -13.4 kcal/mol
(e) -13.3 kcal/mol
(f) -13.3 kcal/mol
(g) -13.3 kcal/mol
(h) -13.3 kcal/mol
(i) -13.3 kcal/mol
Figure 7: Example of ligand poses (generated by RGA) and binding sites of target structures “3ny8”.
(a) -13.1 kcal/mol
(b) -13.1 kcal/mol
(c) -13.1 kcal/mol
(d) -13.0 kcal/mol
(e) -12.9 kcal/mol
(f) -12.8 kcal/mol
Figure 8: Example of ligand poses (generated by RGA) and binding sites of target structures “1iep”.
(a) -12.4 kcal/mol
(b) -12.4 kcal/mol
(c) -12.3 kcal/mol
(d) -12.3 kcal/mol
(e) -12.3 kcal/mol
(f) -12.2 kcal/mol
(g) -12.2 kcal/mol
(h) -12.1 kcal/mol
(i) -12.1 kcal/mol
Figure 9: Example of ligand poses (generated by RGA) and binding sites of target structures “4unn”.

f.6 Example of the generated ligand

This section shows some examples of the generated ligands with desirable binding affinity for various target proteins in Figure 67,  8 and 9, respectively. We observe that the generated ligands bind tightly with the target proteins.

Appendix G Additional Discussion on Related Work

Methodology. The molecule generations methods can be divided into two categories. The first one is deep generative models (DGMs), which leverage the continuous representation to estimate the data distribution using various kinds of deep neural networks, including variational autoencoder (VAE) [gomez2018automatic, jin2018junction], generative adversarial network (GAN) [guimaraes2017objective, cao2018molgan], normalizing flow model [shi2019graphaf, zang2020moflow, luo2021graphdf], energy based model [liu2021graphebm] and diffusion model [xu2021geodiff], etc. The second one is combinatorial optimization methods, which directly search the discrete chemical space, including genetic algorithm (GA) [jensen2019graph, nigam2019augmenting, gao2021amortized], reinforcement learning approaches (RL) [Olivecrona, You2018-xh, zhou2019optimization, jin2020multi], Bayesian Optimization (BO) [korovina2020chembo, moss2020boss], Monte Carlo Tree Search (MCTS) [yang2017chemts, yang2020practical, jin2020multi] and Markov Chain Monte Carlo (MCMC) [fu2021mimosa, xie2021mars, bengio2021gflownet]. Deep generative models (DGMs) usually require a large amount of data to fit, which impedes their usage in low data regime. In contrast, combinatorial optimization methods require less training data, while the trade-off is the need to call the optimization oracles during the exploration in the chemical space [zhou2019optimization, fu2021differentiable, gao2020synthesizability, gao2021amortized, brown2019guacamol].

Among all the machine learning methods, Genetic algorithm (GA) exhibits superior performance in some standard benchmarks [brown2019guacamol, huang2021therapeutics, gao2022sample]. The key reason is GA’s global assembling strategy. Specifically, in each generation (iteration), GA maintains a population of molecule candidates (a.k.a. parents), and conducts the crossover operation between two (random-selected) parent candidates, which enables relatively large exchanges on molecular sub-graph between molecular graphs. However, GA is leveraging random-walk based mutation and crossover operations [jensen2019graph, spiegel2020autogrow4] and is essentially based on brute-force trial and error.

On the other hand, Reinforcement learning (RL) methods are good at navigating the discrete space via prioritizing the promising searching branches and circumventing brute-force search. The current RL-based methods [Olivecrona, You2018-xh, zhou2019optimization, jin2020multi] slightly left behind other state-of-the-art combinatorial optimization methods [fu2021differentiable, huang2021therapeutics]. The main reason is that current RL based molecule optimization approaches are based on auto-regressive assembling strategy, i.e., growing molecules iteratively via adding a basic building block one time, where the building block can be either a token in SMILES representation [Olivecrona] or a substructure in molecular graph representation [You2018-xh, zhou2019optimization, jin2020multi]. Such assembling strategy are essentially local search methods, which hinders the algorithm’s ability to overcome the rough optimization landscape (or energy barrier) and is easy to be stuck in the local optimum [conti2018improving, liu2021decoupling].

Discussion. Among all the machine learning methods, molecular graph level genetic algorithm (GA) exhibits state-of-the-art performance in some standard molecule optimization benchmarks [brown2019guacamol, huang2021therapeutics, gao2022sample]. The key reason is GA’s assembling manner. Specifically, in each generation (iteration), GA maintains a population of possible candidates (a.k.a. parents), and conducts the crossover operation between two candidates to generate new offspring, which enables thorough exploration to the chemical space. However, there is still improvement space for GA. GA are leveraging random-walk based mutation and crossover operations [jensen2019graph] and suffers from brute-force trial and error strategy.

On the other hand, reinforcement learning approaches are good at navigating the discrete space via prioritizing the promising decisions that are worth investigating, for example, AlphaGo successfully applied RL to defeat a professional human Go player [silver2017mastering]. However, the current RL based drug design methods [Olivecrona, You2018-xh, zhou2019optimization] slightly left behind other state-of-the-art combinatorial optimization methods. The main reason lies at the inferior assembling strategy, which grows molecule in an auto-regressive fashion. It is hard for this kind of local search strategy to overcome the barrier of the objective, so it is easy to be trapped into the local optimum.

Deep learning methods can also enhance genetic algorithm. Due to the random selection used in genetic algorithm, it is challenging to apply deep learning methods in the generation of new candidates (molecules in this paper). Deep learning can be used to compose fitness evaluation to select the offspring. For example, GA+D [nigam2019augmenting] leverages deep neural network as a discriminator to measure the drug’s proximity to the training data, which is incorporated as a scorer in fitness evaluation. [kwon2021evolutionary] train a deep neural network-based property predictor and leverage it to enhance the evolutionary algorithm.

In this paper, we attempt to enhance genetic algorithm using reinforcement learning technique. Specifically, we propose Reinforced Genetic Algorithm (RGA), which inherits the assembling manner from genetic algorithm and use reinforcement learning to guide the search over the chemical space. [ahn2020guiding] also combine RL and GA, which uses LSTM (guided by RL agent) to imitate GA process, however, it is unable to inherit the GA’s flexible assembling manner due to the auto-regressive essence of LSTM.