Learning retrosynthetic planning through self-play

01/19/2019 ∙ by John S. Schreck, et al. ∙ Columbia University 0

The problem of retrosynthetic planning can be framed as one player game, in which the chemist (or a computer program) works backwards from a molecular target to simpler starting materials though a series of choices regarding which reactions to perform. This game is challenging as the combinatorial space of possible choices is astronomical, and the value of each choice remains uncertain until the synthesis plan is completed and its cost evaluated. Here, we address this problem using deep reinforcement learning to identify policies that make (near) optimal reaction choices during each step of retrosynthetic planning. Using simulated experience or self-play, we train neural networks to estimate the expected synthesis cost or value of any given molecule based on a representation of its molecular structure. We show that learned policies based on this value network outperform heuristic approaches in synthesizing unfamiliar molecules from available starting materials using the fewest number of reactions. We discuss how the learned policies described here can be incorporated into existing synthesis planning tools and how they can be adapted to changes in the synthesis cost objective or material availability.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The primary goal of computer-aided synthesis planning (CASP) is to help chemists accelerate the synthesis of desired molecules.1, 2, 3 Generally, a CASP program takes as input the structure of a target molecule and returns a sequence of feasible reactions linking the target to commercially available starting materials. The number of possible synthesis plans is often astronomical, and it is therefore desirable to identify the plan(s) that minimize some user-specified objective function (e.g., synthesis cost ). The challenge of identifying these optimal syntheses can be framed as a one-player game—the retrosynthesis game—to allow for useful analogies with chess and Go, for which powerful solutions based on deep reinforcement learning now exist 4, 5. During play, the chemist starts from the target molecule and identifies a set of candidate reactions by which to make the target in one step (Figure 1). At this point, the chemist must decide which reaction to choose. As in other games such as chess, the benefits of a particular decision may not be immediately obvious. Only when the game is won or lost can one fairly assess the value of decisions that contributed to the outcome. Once a reaction is selected, its reactant(s) become the new target(s) of successive retrosynthetic analyses. This branching recursive process of identifying candidate reactions and deciding which to use continues until the growing synthesis tree reaches the available substrates (a “win”) or it exceeds a specified number of synthetic steps (a “loss”).

Winning outcomes are further distinguished by the cost of the synthesis pathway identified—the lower the better. This synthesis cost is often ambiguous and difficult to evaluate as it involves a variety of uncertain or unknown quantities. For example, the synthesis cost might include the price of the starting materials, the number of synthetic steps, the yield of each step, the ease of product separation and purification, the amount of chemical waste generated, the safety or environmental hazards associated with the reactions and reagents, et cetera. It is arguably more challenging to accurately evaluate the cost of a proposed synthesis than it is to generate candidate syntheses. It is therefore common to adopt simple objective functions that make use of the information available (e.g., the number of reactions but not their respective yields).6 We refer to the output of any such function as the cost of the synthesis; optimal synthesis plans correspond to those with minimal cost.

Figure 1: The objective of the retrosynthesis game is to synthesize the target product from available substrates by way of a synthesis tree that minimizes the cost function. Molecules and reactions are illustrated by circles and squares, respectively. Starting from the target, a reaction (yellow) is selected according to a policy that links with precursors . The grey squares leading to illustrate the other potential reactions in . The game continues one move at time reducing intermediate molecules (blue) until there are only substrates remaining, or until a maximum depth of 10 is reached. Rare molecules (green), which are not makeable under these constraints nor commercially available, are assigned a cost penalty of 100, while molecules at maximum depth (purple) are assigned a cost penalty of 10. Commercially available substrates are assigned zero cost. The synthesis cost of the product may be computed according to Eq. 1 only on completion of the game. Here, the sampled pathway leading to the target (red arrows) has a cost of 5.


An expert chemist excels at the retrosynthesis game for two reasons: (1) she can identify a large number of feasible reaction candidates at each step, and (2) she can select those candidates most likely to lead to winning syntheses. These abilities derive from the chemist’s prior knowledge and her past experience in making molecules. In contrast to games with a fixed rule set like chess, the identification of feasible reaction candidates (i.e., the possible “moves”) is nontrivial: there may be thousands of possible candidates at each step using known chemistries. To address this challenge, computational approaches have been developed to suggest candidate reactions using libraries of reaction templates prepared by expert chemists7 or derived from literature data.8, 9, 10 Armed with these “rules” of synthetic chemistry, a computer can, in principle, search the entire space of possible synthesis pathways and identify the optimal one.

In practice, however, an exhaustive search of possible synthesis trees is not computationally feasible or desirable owing to the exponential growth in the number of reactions with distance from the target.11, 6 Instead, search algorithms generate a subset of possible synthesis trees, which may or may not contain the optimal pathway(s). For longer syntheses, the subset of pathways identified is an increasingly small fraction of the total available. Thus, it is essential to bias retrosynthetic search algorithms towards those regions of synthesis space most likely to contain the optimal pathway. In the game of retrosynthesis, the player requires a strong guiding model, or policy, for selecting the reaction at each step that leads to the optimal synthetic pathway(s).

Prior reports on retrosynthetic planning have explored a variety of policies for guiding the generation of candidate syntheses.12, 13, 7, 14 These programs select among possible reactions using heuristic scoring functions,15, 7 crowd-sourced accessibility scores,16, 13 analogy to precedent reactions,17

or parametric models (e.g., neural networks) trained on literature precedents.

18, 19 In particular, the Syntaurus software7 allows for user-specified scoring functions that can describe common strategies used by expert chemists20 (e.g., using symmetric disconnections to favor convergent syntheses). By contrast, Segler and Waller used literature reaction data to train a neural network that determines which reaction templates are most likely to be effective on a given molecule.18 The ability to rank order candidate reactions (by any means) allows for guiding network search algorithms (e.g., Monte Carlo tree search19) to generate large numbers of possible synthesis plans. The costs of these candidates can then be evaluated to identify the “best” syntheses, which are provided to the chemist (or perhaps a robotic synthesizer).

Here, we describe a different approach to retrosynthetic planning based on reinforcement learning,21 in which the computer learns to select those candidate reactions that lead ultimately to synthesis plans minimizing a user-specified cost function. Our approach is inspired by the recent success of deep reinforcement learning in mastering combinatorial games such as Go using experience generated by repeated self-play 4, 5. In this way, DeepMind’s AlphaGo Zero learned to estimate the value of any possible move from any state in the game, thereby capturing the title of world champion.22, 4 Similarly, by repeated plays of the retrosynthesis game, the computer can learn which candidate reactions are most likely to lead from a given molecule to available starting materials in an optimal fashion. This approach requires no prior knowledge of synthetic strategy beyond the “rules” governing single-step reactions encoded in a library of reaction templates. Starting from a random policy, the computer explores the synthetic space to generate estimates of the synthesis cost for any molecule. These estimates form the basis for improved policies that guide the discovery of synthesis plans with lower cost. This iterative process of policy improvement converges in time to optimal policies that identify the “best” pathway in a single play of the retrosynthesis game. Importantly, we show that (near) optimal policies trained on the synthesis of 100,000 diverse molecules generalize well to the synthesis of unfamiliar molecules. We discuss how the learned policies described here can be incorporated into existing synthesis planning tools and how they can be adapted to different cost functions that reflect the changing demands of organic synthesis.

2 Results and Discussion

2.1 The Retrosynthesis Game

We formulate the problem of retrosynthetic analysis as a game played by a synthetic chemist or a computer program. At the start of the game, the player is given a target molecule to synthesize starting from a set of buyable molecules denoted . For any such molecule, there exists a set of reactions, denoted , where each reaction includes the molecule as a product. From this set, the player chooses a particular reaction according to a policy

, which defines the probability of selecting that reaction for use in the synthesis. The cost

of performing the chosen reaction is added to a running total, which ultimately determines the overall synthesis cost. Having completed one step of the retrosynthesis game, the player considers the reactant(s) of the chosen reaction in turn. If a reactant is included among the buyable substrates, , then the cost of that molecule is added to the running total. Otherwise, the reactant must be synthesized following the same procedure outlined above. This recursive processes results in a synthesis tree whose root is the target molecule and whose leaves are buyable substrates. The total cost of the resulting synthesis is


where the respective sums are evaluated over all reactions and all leaf molecules included in the final synthesis tree. This simple cost function neglects effects due to interactions between successive reactions (e.g., costs incurred in switching solvents); however, it has the useful property that the expected cost of making any molecule in one step via reaction is directly related to the expected cost of the associated reactants ,


This recursive function terminates at buyable molecules of known cost, for which independent of the policy.

The function denotes the expected cost or “value” of any molecule under a specified policy . By repeating the game many times starting from many target molecules, it is possible to learn a parametric representation of this function valid for most molecules. Importantly, knowledge of the value function under one policy enables the creation of new and better policies that reduce the expected cost of synthesizing a molecule. Using methods of reinforcement learning, such iterative improvement schemes lead to the identification of optimal policies , which identify synthesis trees of minimal cost. The value of a molecule under such a policy is equal to the expected cost of selecting the “best” reaction at each step such that


From a practical perspective, the optimal value function takes as input a molecule (e.g., a representation of its molecular structure) and outputs a numeric value corresponding to the minimal cost with which it can be synthesized.

Here, we considered a set of 100,000 target molecules selected from the Reaxys database on the basis of their structural diversity (see Methods). The set of buyable substrates contained 300,000 molecules selected from the Sigma-Aldrich,23 eMolecules,24 and LabNetwork25 catalogs that have list prices less than $100/g. At each step, the possible reactions were identified using a set of 60,000 reaction templates derived from more than 12 million single-step reaction examples reported in the Reaxys database (see Methods). Because the application of reaction templates is computationally expensive, we used a template prioritizer to identify those templates most relevant to a given molecule.18 On average, this procedure resulted in up to 50 possible reactions for each molecule encountered during synthesis planning. We assume the space of molecules and reactions implicit in these transformations is representative of real organic chemistry while recognizing the inevitable limitations of templates culled from incomplete and sometimes inaccurate reaction databases. For simplicity, the cost of each reaction step was set to one, , and the substrate costs to zero, . With these assignments, the cost of making a molecule is equivalent to the number of reactions in the final synthesis tree.

To prohibit the formation of unreasonably deep synthesis trees, we limited our retrosynthetic searches to a maximum depth of . As detailed in the Methods, the addition of this termination criterion to the recursive definition of the value function (2) requires some minor modifications to the retrosynthesis game. In particular, the expected cost of synthesizing a molecule depends also on the residual depth, , where is the difference between the maximum depth and the current depth within the tree. If a molecule not included among the buyable substrates is encountered at a residual depth of zero, it is assigned a large cost , thereby penalizing the failed search. Additionally, in the event that no reactions are identified for a given molecule (), we assign an even larger penalty , which encourages the player to avoid such molecules if possible. Below, we use the specific numeric penalties of and for all games.

2.2 Heuristic Policies

En route to the development of optimal policies for retrosynthesis, we first consider the performance of some heuristic policies that provide context for the results below. Arguably the simplest policy is one of complete ignorance, in which the player selects a reaction at random at each stage of the synthesis—that is, . We use this “straw man” policy to describe the general process of policy evaluation and provide a baseline from which to measure subsequent improvements.

During the evaluation process, the computer plays the retrosynthesis game to the end making random moves at each step of the way. After each game, the cost of each molecule in the resulting synthesis tree is computed. This process is repeated for each of the 100,000 target molecules considered. These data points—each containing a molecule at residual depth with cost —are used to update the parametric approximation of the value function . As detailed in the Methods, the value function is approximated by a neural network that takes as input an extended-connectivity (ECFP) fingerprint of the molecule and the residual depth and outputs a real valued estimate of the expected cost under the policy .26 This process is repeated in an iterative manner as the value estimates of the target molecules, , approach their asymptotic values.

Figure 2a shows the total synthesis cost for a single target molecule under the random policy (markers). Each play of the retrosynthesis game has one of three possible outcomes: a “winning” synthesis plan terminating in buyable substrates (blue circles), a “losing” plan that exceeds the maximum depth (green triangles), and a “losing” plan that contains dead-end molecules that cannot be bought or made (black pentagons). After many synthesis attempts, the running average of the fluctuating synthesis cost converges to the expected cost as approximated by the neural network (red line). Repeating this analysis for the 100,000 target molecules, the random policy results in an average cost of 110 per molecule with only a 25% chance of identifying a winning synthesis in each attempt. Clearly, there is room for improvement.

Figure 2: Heuristic policies. (a) Synthesis cost for a single molecule (N-dibutyl-4-acetylbenzeneacetamide) for successive iterations of the retrosynthesis game under the random policy. Blue circles denote “winning” synthesis plans that trace back to buyable molecules. Green triangles and black pentagons denote “losing” plans that exceed the maximum depth or include unmakable molecules, respectively. The solid line shows the neural network prediction of the value function as it converges to the average synthesis cost. The dashed line shows the expected cost under the deterministic “symmetric disconnection” policy with . (b) Distribution of expected costs over the set of 100,000 target molecules for different noise levels . The red squares and black circles show the performance of the symmetric disconnection policy () and the random policy (), respectively. (c) The average synthesis cost of the target molecules decreases with increasing noise level , while the average branching factor increases. Averages were estimated from 50 plays for each target molecule.

Beyond the random policy, even simple heuristics can be used to improve performance significantly. In one such policy, inspired by Syntaurus,7 the player selects the reaction that maximizes the quantity


where is the length of the canonical smiles string representing molecule , is a user-specified exponent, and the sum is taken over all reactants associated with a reaction . When , the reactions that maximize this function can be interpreted as those that decompose the product into multiple parts of roughly equal size. Note that in contrast to the random policy, this greedy heuristic is deterministic: each play of the game results in the same outcome. Figure 2a shows the performance of this “symmetric disconnection” policy with for a single target molecule (dashed line). Interestingly, while the pathway identified by the greedy policy is much better on average than those of the random policy ( vs. ), repeated application of the latter reveals the existence of an even better pathway containing only three reactions. An optimal policy would allow for the identification of that best synthesis plan during a single play of the retrosynthesis game.

The performance of a policy is characterized by the distribution of expected costs over the set of target molecules. Figure 2

b shows the cost distribution for a series of policies that interpolate between the greedy “symmetric disconnection” policy and the random policy (see also Figure 

6). The intermediate -greedy policies behave greedily with probability , selecting the reaction that maximizes , but behave randomly with probability , selecting any one of the possible reactions with equal probability. On average, the addition of such noise is usually detrimental to policy performance. Noisy policies are less likely to identify a successful synthesis for a given target (Figure 7a) and result in longer syntheses when they do succeed (Figure 7b). Consequently, the average cost increases monotonically with increasing noise as quantified by the parameter (Figure 2c).

The superior performance (lower synthesis costs) of the greedy policy is correlated with the average branching factor , which represents the average number of reactants for each reaction in the synthesis tree. Branching is largest for the greedy policy () and decreases monotonically with increasing (Figure 2c). On average, synthesis plans with greater branching (i.e., convergent syntheses) require fewer synthetic steps to connect the target molecules to the set of buyable substrates. This observation supports the chemical intuition underlying the symmetric disconnection policy: break apart each “complex” molecule into “simpler” precursors. However, this greedy heuristic can sometimes be short-sighted. An optimal retrosynthetic “move” may increase molecular complexity in the short run to reach simpler precursors more quickly in the longer run (e.g., in protecting group chemistry). An optimal policy would enable the player to identify local moves (i.e., reactions) that lead to synthesis pathways with minimum total cost.

2.3 Policy improvement through self-play

Knowledge of the value function, , under a given policy enables the identification of better policies that reduce the expected synthesis cost. To see this, consider a new policy that selects at each step the reaction that minimizes the expected cost under the old policy


By the policy improvement theorem21, this greedy policy is guaranteed to be as good or better than the old policy —that is, , where equality holds only for the optimal policy. This result provides a basis for systematically improving any policy in an iterative procedure called policy iteration21, in which the value function leads to an improved policy that leads to a new value function and so on.

One of the challenges in using the greedy policy Eq. 5 is that it generates only a single pathway and its associated cost for each of the target molecules. The limited exposure of these greedy searches can result in poor estimates of the new value function , in particular for molecules that are not included in the identified pathways. A better estimate of can be achieved by exploring more of the molecule space in the neighborhood of these greedy pathways. Here, we encourage exploration by using an -greedy policy, which introduces random choices with probability but otherwise follows the greedy policy Eq. 5. Iteration of this -soft policy is guaranteed to converge to an optimal policy that minimizes the expected synthesis cost for a given noise level .21 Moreover, by gradually lowering the noise level, it is possible to approach the optimal greedy policy in the limit as .

2.4 Training protocol

Starting from the random policy, we used self-play to learn an improved policy over the course of 1000 iterations, each comprised of 100,000 retrosynthesis games initiated from the target molecules. During the first iteration, each target molecule was considered in turn using the -greedy policy Eq. 5 with

. Candidate reactions and their associated reactants were identified by application of reaction templates as detailed in the Methods. Absent an initial model of the value function, the expected costs of molecules encountered during play were selected at random from a uniform distribution on the interval

. Following the completion of each game, the costs of molecules in the selected pathway were computed and stored for later use. In subsequent iterations, the values of molecules encountered previously (at a particular depth) were estimated by their average cost. After the first 50 iterations, the value estimates accumulated during play were used to train a neural network, which allowed for estimating the values of new molecules not encountered during the previous games (see Methods for details on the network architecture and training). Policy improvement continued in an iterative fashion guided both by the average costs (for molecules previously encountered) and by the neural network (for new molecules), which was updated every 50-100 iterations.

During policy iteration, the noise parameter was reduced from to in increments of 0.05 every 200 iterations in an effort to anneal the system towards an optimal policy. Following each change in , the saved costs were discarded such that subsequent value estimates were generated at the current noise level . The result of this training procedure was a neural network approximation of the (near) optimal value function , which estimates the minimum cost of synthesizing any molecule starting from residual depth . In practice, we found that a slightly better value function could be obtained using the cumulative reaction network generated during policy iteration. Following Kowalik et al.,6 we used dynamic programming to compute the minimum synthesis cost for each molecule in the reaction network. These minimum costs were then used to train the final neural network approximation of the value function .

Figure 3: Training results. In (a) and (b) and computed using are plotted versus policy iterations, respectively (solid blue squares). Solid horizontal lines show these quantities for the heuristic policy (red triangles) and the random policy (black circles). The larger cyan square shows after each tree had been searched for the best (lowest) target cost. Dashed vertical lines show points when was lowered.

2.5 Training results

Figure 3a shows how the average synthesis cost decreased with each iteration over the course of the training process. Initially, the average cost was similar to that of the random policy () but improved steadily as the computer learned to identify “winning” reactions that lead quickly to buyable substrates. After 800 iterations, the cost dropped below that of the symmetric disconnection policy () but showed little further improvement in the absence of exploration (i.e., with ). The final cost estimate (, cyan square) was generated by identifying the minimum cost pathways present in the cumulative reaction network generated during the training process. The final drop in cost for suggests that further policy improvements are possible using improved annealing schedules. We emphasize that the final near-optimal policy was trained from a state of complete ignorance, as directed by the user-specified objective function to minimize the synthesis cost.

During the training process, the decrease in synthesis cost was guided both by motivation, as prescribed by the cost function, and by opportunity, as dictated by the availability of alternate pathways. Early improvements in the average cost were achieved by avoiding “unmakable” molecules, which contributed the largest cost penalty, . Of the target molecules, 11% reduced their synthesis cost from to by avoiding such problematic molecules. By contrast, only 2% of targets improved their cost from to . In other words, if a synthesis tree was not found initially at a maximum depth of , it was unlikely to be discovered during the course of training. Perhaps more interesting are those molecules (ca.  10%) for which syntheses were more easily found but subsequently improved (i.e., shortened) during the course of the training process. See Table 1 for a more detailed breakdown of these different groups.

Train (100,00) Test (25,000)
19.3 13.1 19.2 11.5
1.54 1.65 1.54 1.58
























Table 1: Training and testing results for the symmetric disconnection policy and the learned policy . Percentages were computed based the sizes of the training set (100,000) and the testing set (25,000).

Consistent with our observations above, lower cost pathways were again correlated with the degree of branching along the synthesis trees (Figure 3b). Interestingly, the average branching factor for synthesis plans identified by the learned policy was significantly larger than that of the symmetric disconnection policy ( vs. ). While the latter favors branching, it does so locally based on limited information—namely, the heuristic score of Eq. 5. By contrast, the learned policy uses information provided in the molecular fingerprint to select reactions that increase branching across the entire synthesis tree (not just the single step). Furthermore, while the heuristic policy favors branching a priori, the learned policy does so only in the service of reducing the total cost. Changes in the objective function (e.g., in the cost and availability of the buyable substrates) will lead to different learned policies.

2.6 Model validation

Figure 4: Model Validation. In (a) and (b) a 2D histogram illustrates the relationship between the synthesis cost determined by the learned policy and that predicted by the value network for each of the 100,000 training molecules and the 25,000 testing molecules, respectively. In (c) and (d), a 2D histogram compares the synthesis cost determined by the symmetric disconnection policy to that of learned policy for training and testing molecules, respectively. In (a-d), the gray-scale intensity is linearly proportional to the number of molecules within a given bin; the red line shows the identity relation. In (c) and (d), the percentage of molecules for which (

) found the cheaper pathway is listed below (above) the red line. In (e) and (f), distributions of synthesis costs

determined under policies and are shown for both testing and testing molecules, respectively.

Figure 4 compares the performance of the learned policy evaluated on the entire set of 100,000 target molecules used for training and on a different set of 25,000 target molecules set aside for testing. For the training molecules, the value estimates predicted by the neural network are highly correlated with the actual costs obtained by the final learned policy (Figure 4a). We used the same near-optimal policy to determine the synthesis cost of the testing molecules, . As illustrated in Figure 4b, these costs were correlated to the predictions of the value network albeit more weakly than those of the training data (Pearson coefficient of 0.5 for testing vs. 0.99 for training). This correlation was stronger for the data in Figure 4b, which focuses on those molecules that could actually be synthesized (Pearson coefficient of 0.7 for the 73% testing molecules with “winning” syntheses).

Figure 4c,d compares the synthesis costs of the symmetric disconnection policy against that of the learned policy for both the training and testing molecules. The figure shows that the results are highly correlated (Pearson coefficient 0.84 and 0.86 for training and testing, respectively), indicating that the two policies make similar predictions. However, closer inspection reveals that the learned policy is systematically better than the heuristic as evidenced by the portion of the histogram below the diagonal (red line). For these molecules (42% and 31% of the training and testing sets, respectively), the learned policy identifies synthesis trees containing fewer reactions than those of the heuristic policy during single deterministic plays of the retrosynthesis game. By contrast, it is rare in both the training and testing molecules (about 4% and 11%, respectively) that the symmetric disconnection policy performs better than the learned policy. Additionally, the learned policy is more likely to succeed in identifying a viable synthesis plan leading to buyable substrates (Figure 4c). Of the 25,000 testing molecules, “winning” synthesis plans were identified for 73% using the learn policy as compared to 64% using the heuristic. These results suggest that the lessons gleaned from the training molecules can be used to improve the synthesis of new and unfamiliar molecules.

3 Conclusions

We have shown that reinforcement learning can be used to identify effective policies for the computational design of retrosynthetic pathways. In this approach, one specifies the global objective function to be minimized (here, the synthesis cost) without the need for ad hoc models or heuristics to guide local decisions during generation of the synthesis plan. Starting from a random policy, repeated plays of the retrosynthesis game are used to systematically improve performance in an iterative process that converges in time to an optimal policy. The learned value function provides a convenient estimate for the synthesis cost of any molecule, while the associated policy allows for rapid identification of the synthesis path. Importantly, the cost function can be easily adapted, for example to include specific costs for reactions and/or buyable substrates. Policy iteration using a different cost function will result in a different policy that reflects the newly specified objectives.

The chemical feasibility of synthetic pathways identified by the learned policy are largely determined by the quality of the reaction templates. The present templates are derived algorithmically from reaction precedents reported in the literature; however, an identical approach based on reinforcement learning could be applied using template libraries curated by human experts 7, 27

. Alternatively, it may be possible to forgo the use of reaction templates altogether in favor of machine learning approaches that suggest reaction precursors by other means.

28 Ideally, such predictions should be accompanied by recommendations regarding the desired conditions for performing each reaction in high yield.29, 30, 31, 32

In the present approach, the deterministic policy learned during training is applied only once to suggest one (near) optimal synthesis pathway. Additional pathways are readily generated, for example using Monte-Carlo Tree Search (MCTS) to bias subsequent searches away from previously identified pathways.18 A similar approach is used by Syntaurus, which relies on heuristic scoring functions to guide the generation of many possible synthesis plans, from which the “best” are selected. The main advantage of a strong learned policy is to direct such exploration more effectively towards these best syntheses, thereby reducing the computational cost of exploration.

We note, however, that the computational costs of training the learned policy are significant (ca.  several million CPU hours for the training in Figure 3). While the application of reaction templates remains the primary bottleneck (ca. 50%), the additional costs of computing ECFP fingerprints and evaluating the neural network were a close second (ca. 45%). These costs can be greatly reduced by using simple heuristics to generate synthetic pathways, from which stronger policies can be learned. We found that Eq. 4 performed remarkably well and was much faster to evaluate than the neural network. Such fast heuristics could be used as starting points for iterative policy improvement or as roll-out policies within MCTS-based learning algorithms.21 This approach is conceptually similar to the first iteration of AlphaGo introduced by DeepMind.33. Looking forward, we anticipate that the retrosynthesis game will soon follow the way of chess and go, in which self-taught algorithms consistently outperform human experts.

4 Methods

4.1 Target molecules

Training/testing sets of 95,774/23,945 molecules were selected from the Reaxys database on the basis of their structural diversity. Starting from more than 20 million molecules in the database, we excluded (i) those listed in the database of buyable compounds, (ii) those with SMILES strings shorter than 20 or longer than 100, and (iii) those with multiple fragments (i.e., molecules with ‘.’ in the SMILES string). The resulting million molecules were then aggregated using the Taylor-Butina (TB) algorithm34 to form million clusters, each comprised of “similar” molecules. Structural similarity between two molecules and was determined by the Tanimoto coefficient


where is the ECFP4 fingerprint for molecule .35 We used fingerprints of length 1024 and radius 3. Two molecules within a common cluster were required to have a Tanimoto coefficient of . The target molecules were chosen as the centroids of the largest clusters, each containing more than 20 molecules and together representing more than million molecules. These target molecules were partitioned at random to form the final sets for training and testing.

4.2 Buyable molecules

A molecule is defined to be a substrate if it is listed in the commercially available Sigma-Aldrich,23 eMolecules,24 or LabNetwork catalogs,25 and does not cost more than $100/g. The complete set of molecules in these catalogs with is denoted with . The molecules contained in each catalog can be downloaded by visiting the respective webpage.

4.3 Reaction templates

Given a molecule , we used a set of reaction templates to generate sets of possible precursors , which can be used to synthesize in one step. As detailed previously,36 these templates were extracted automatically from literature precedents and encoded using the SMARTS language. The application of the templates involves two main steps, substructure matching and bond rewiring, which were implemented using RDkit37. Briefly, we first search the molecule for a structural pattern specified by the template. For each match, the reaction template further specifies the breaking and making of bonds among the constituent atoms to produce the precursor molecule(s) . We used the RDChiral package38 to handle the creation, destruction, and preservation of chiral centers during the reaction.

The application of reaction templates to produce candidate reactions represents a major computational bottleneck in the retrosynthesis game due to the combinatorial complexity of substructure matching. Additionally, even when a template generates a successful match, it may fail to account for the larger molecular context resulting in undesired byproducts during the forward reaction. These two challenges can be partially alleviated by use of a “template prioritizer”,18 which takes as input a representation of the target molecule

and generates a probability distribution over the set of templates based on their likelihood of success. By focusing only on the most probable templates, the prioritizer can serve to improve both quality of the suggested reactions and the speed with which they are generated. In practice, we trained a neural network prioritizer on 5.4 million reaction examples from Reaxys and selected the top 99.5% of templates for each molecule

encountered. This filtering process drastically reduced the total number templates applied from 60,000 to less than 50 for most molecules. The training and validation details as well as the model architecture are available on Github39.

4.4 Policy Iteration

As noted in the main text, the depth constraint imposed on synthesis trees generated during the retrosynthesis requires some minor modifications to the value function of Eq. 2. The expected cost of synthesizing a molecule now depends on the residual depth as


where the first sum is over candidate reactions with as product, and the second is over all reactants associated with a reaction . For the present cost model, the expected cost increases with decreasing due to the increased likelihood of being penalized (to the extent ) for reaching the maximum depth ( such that ). Similarly, the -greedy policy used in policy improvement must also account for the residual depth at which a molecule is encountered


These recursive functions are fully specified by three terminating conditions introduced in the main text: (1) buyable molecule encountered, for ; (2) maximum depth reached, ; and (3) unmakeable molecule encountered, for .

4.5 Neural network architecture and training

We employed a multi-layer neural network illustrated schematically in Figure 5

. The 17 million model parameters were learned using gradient descent on training data generated by repeated plays of the retrosynthesis game. Training was performed using Keras with the Theano backend and the Adam optimizer with an initial learning rate of

, which decayed with the number of model updates as (13 updates were used to compute ). During each update, batches of 128 molecules and their computed average costs at a fixed

were selected from the most recent data and added to a replay buffer. Batches of equivalent size were randomly selected from the buffer and passed through the model for up to 100 epochs (1 epoch was taken as the total number of new data points having passed through the network). The mean-average error between the averaged (true) and predicted costs was used as the loss function. The latest model weights were then used as the policy for the next round of synthesis games.

Figure 5:

The neural model for the cost of molecules is a feed-forward neural network that accepts as input (green) an ECFP fingerprint of size 16384 extended to include the residual depth

of the molecule. The architecture includes one input later (blue) consisting of 1024 nodes, five hidden layers (red) each containing 300 nodes, and one output layer (purple) of size one plus a filter (also purple) that scales the initial output number to be within the range

. We also used batch normalization after each layer. The final output represents the estimated cost.

4.6 Additional results

Figure 6 shows the full cost distribution for the heuristic policy () and the learned optimal policy (). Figure 7a shows the probability of successfully synthesizing a molecule from the training set () versus . Figure 7b shows the average cost () of the training set versus . Figure 8 shows normalized probability distributions for the target costs in the training set for different values of . Figure 9 shows the 2D probability distribution computed with for the training set.

Train (100,00) Test (25,000)




Tie (






Tie (


0.4 35.0 47.9 6.1 21.3 47.9
3.9 5.2 5.1 3.4 8.6 9.9
1 1.6 1 1 1.1 1.3
Bulk 4.4 41.8 53.8 9.9 31.0 59.1
Table 2: Comparison of the performance of versus . The values show the percent that one policy found a lower cost than the other, or whether the two policies found pathways with identical cost (e.g.  a tie), for the molecules in the training and testing sets. All percents were computed using the size of the training (100,000) or testing set (25,000).
Figure 6: The distribution of expected costs, , over the set of 100,000 target molecules is shown for the greedy optimal policy and the symmetric disconnection policy for different values of .
Figure 7: (a) The probability of successfully synthesizing target molecules and (b) (bottom) for those synthesized are shown versus for the symmetric disconnection policy .
Figure 8: Normalized probability distributions are shown for different values of for the symmetric disconnection policy . Note that the columns in the plot each sum up to one.
Figure 9: A 2D histogram is shown for the symmetric disconnection policy .

5 Acknowledgments

This work was supported by the DARPA Make-It program under contract ARO W911NF-16-2-0023. We acknowledge computing resources from Columbia University’s Shared Research Computing Facility project, which is supported by NIH Research Facility Improvement Grant 1G20RR030893-01, and associated funds from the New York State Empire State Development, Division of Science Technology and Innovation (NYSTAR) Contract C090171, both awarded April 15, 2010.


  • Warr 2014 Warr, W. A. Mol. Inf. 2014, 33, 469–476.
  • Engkvist et al. 2018 Engkvist, O.; Norrby, P.-O.; Selmi, N.; Lam, Y.-h.; Peng, Z.; Sherer, E. C.; Amberg, W.; Erhard, T.; Smyth, L. A. Drug Discovery Today 2018,
  • Coley et al. 2018 Coley, C. W.; Green, W. H.; Jensen, K. F. Acc. Chem. Res. 2018, 51, 1281–1289.
  • Silver et al. 2017 Silver, D. et al. Nature 2017, 550, 354.
  • Silver et al. 2018 Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; Lillicrap, T.; Simonyan, K.; Hassabis, D. Science 2018, 362, 1140–1144.
  • Kowalik et al. 2012 Kowalik, M.; Gothard, C. M.; Drews, A. M.; Gothard, N. A.; Weckiewicz, A.; Fuller, P. E.; Grzybowski, B. A.; Bishop, K. J. Angew. Chem. Int. Ed. 2012, 51, 7928–7932.
  • Szymkuć et al. 2016 Szymkuć, S.; Gajewska, E. P.; Klucznik, T.; Molga, K.; Dittwald, P.; Startek, M.; Bajczyk, M.; Grzybowski, B. A. Angew. Chem. Int. Ed. 2016, 55, 5904–5937.
  • Law et al. 2009 Law, J.; Zsoldos, Z.; Simon, A.; Reid, D.; Liu, Y.; Khew, S. Y.; Johnson, A. P.; Major, S.; Wade, R. A.; Ando, H. Y. J. Chem. Inf. Model. 2009, 49, 593–602.
  • Christ et al. 2012 Christ, C. D.; Zentgraf, M.; Kriegl, J. M. J. Chem. Inf. Model. 2012, 52, 1745–1756.
  • Bøgevig et al. 2015 Bøgevig, A.; Federsel, H.-J.; Huerta, F.; Hutchings, M. G.; Kraut, H.; Langer, T.; Löw, P.; Oppawsky, C.; Rein, T.; Saller, H. Org. Process Res. Dev. 2015, 19, 357–368.
  • Grzybowski et al. 2009 Grzybowski, B. A.; Bishop, K. J.; Kowalczyk, B.; Wilmer, C. E. Nat. Chem. 2009, 1, 31.
  • Bertz 1981 Bertz, S. H. J. Am. Chem. Soc. 1981, 103, 3599–3601.
  • Sheridan et al. 2014 Sheridan, R. P.; Zorn, N.; Sherer, E. C.; Campeau, L.-C.; Chang, C.; Cumming, J.; Maddess, M. L.; Nantermet, P. G.; Sinz, C. J.; O’Shea, P. D. J. Chem. Inf. Model. 2014, 54, 1604–1616.
  • Coley et al. 2018 Coley, C. W.; Rogers, L.; Green, W. H.; Jensen, K. F. J. Chem. Inf. Model. 2018, 58, 252–261.
  • Weininger 1988 Weininger, D. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36.
  • Ertl and Schuffenhauer 2009 Ertl, P.; Schuffenhauer, A. J. Cheminformatics 2009, 1, 8.
  • Coley et al. 2017 Coley, C. W.; Rogers, L.; Green, W. H.; Jensen, K. F. ACS Central Science 2017, 3, 1237–1245.
  • Segler and Waller 2017 Segler, M. H.; Waller, M. P. Chem. Eur. J. 2017, 23, 5966–5971.
  • Segler et al. 2018 Segler, M. H.; Preuss, M.; Waller, M. P. Nature 2018, 555, 604.
  • Corey 1991 Corey, E. J. The logic of chemical synthesis; John Wiley & Sons, 1991.
  • Sutton and Barto 2017 Sutton, R.; Barto, A. Reinforcement Learning: An Introduction, second edition ed.; MIT Press, 2017.
  • Mnih et al. 2015 Mnih, V. et al. Nature 2015, 518, 529.
  • 23 Sigma-Aldrich, Inc. https://www.sigmaaldrich.com, Accessed: 2018-12-17.
  • 24 E-molecules. https://www.emolecules.com/info/plus/download-database, Accessed: 2018-12-17.
  • 25 LabNetwork Collections. https://www.labnetwork.com/frontend-app/p/#!/screening-sets, Accessed: 2018-12-17.
  • Rogers and Hahn 2010 Rogers, D.; Hahn, M. J. Chem. Inf. Model. 2010, 50, 742–754.
  • Klucznik et al. 2018 Klucznik, T. et al. Chem 2018, 4, 522–532.
  • Liu et al. 2017 Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. ACS Cent. Sci. 2017, 3, 1103–1113.
  • Marcou et al. 2015 Marcou, G.; Aires de Sousa, J.; Latino, D. A.; de Luca, A.; Horvath, D.; Rietsch, V.; Varnek, A. J. Chem. Inf. Model. 2015, 55, 239–250.
  • Lin et al. 2016 Lin, A. I.; Madzhidov, T. I.; Klimchuk, O.; Nugmanov, R. I.; Antipin, I. S.; Varnek, A. J. Chem. Inf. Model. 2016, 56, 2140–2148.
  • Segler and Waller 2017 Segler, M. H.; Waller, M. P. Chem. Eur. J. 2017, 23, 6118–6128.
  • Gao et al. 2018 Gao, H.; Struble, T. J.; Coley, C. W.; Wang, Y.; Green, W. H.; Jensen, K. F. ACS Central Science 2018, 4, 1465–1476.
  • Silver et al. 2016 Silver, D. et al. Nature 2016, 529, 484.
  • Butina 1999 Butina, D. J. Chem. Inf. Comput. Sci. 1999, 39, 747–750.
  • Glen et al. 2006 Glen, R. C.; Bender, A.; Arnby, C. H.; Carlsson, L.; Boyer, S.; Smith, J. IDrugs 2006, 9, 199.
  • Coley et al. 2017 Coley, C. W.; Barzilay, R.; Jaakkola, T. S.; Green, W. H.; Jensen, K. F. ACS Cent. Sci. 2017, 3, 434–443.
  • 37 RDKit: Open-source cheminformatics. http://www.rdkit.org, Accessed: 2018-12-17.
  • 38 Coley, C. W. RDChiral. https://github.com/connorcoley/rdchiral, Accessed: 2018-07-18.
  • 39 Coley, C. W. Retrotemp. https://github.com/connorcoley/retrotemp/tree/master/retrotemp, Accessed: 2018-12-17.