A Bayesian algorithm for retrosynthesis

03/06/2020 ∙ by Zhongliang Guo, et al. ∙ 0

The identification of synthetic routes that end with a desired product has been an inherently time-consuming process that is largely dependent on expert knowledge regarding a limited fraction of the entire reaction space. At present, emerging machine-learning technologies are overturning the process of retrosynthetic planning. The objective of this study is to discover synthetic routes backwardly from a given desired molecule to commercially available compounds. The problem is reduced to a combinatorial optimization task with the solution space subject to the combinatorial complexity of all possible pairs of purchasable reactants. We address this issue within the framework of Bayesian inference and computation. The workflow consists of two steps: a deep neural network is trained that forwardly predicts a product of the given reactants with a high level of accuracy, following which this forward model is inverted into the backward one via Bayes' law of conditional probability. Using the backward model, a diverse set of highly probable reaction sequences ending with a given synthetic target is exhaustively explored using a Monte Carlo search algorithm. The Bayesian retrosynthesis algorithm could successfully rediscover 80.3 within top-10 accuracy, respectively, thereby outperforming state-of-the-art algorithms in terms of the overall accuracy. Remarkably, the Monte Carlo method, which was specifically designed for the presence of diverse multiple routes, often revealed a ranked list of hundreds of reaction routes to the same synthetic target. We investigated the potential applicability of such diverse candidates based on expert knowledge from synthetic organic chemistry.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The objective of retrosynthetic planning is to design a synthetic route backwardly from a given desired molecule to commercially available starting materials. In 1969, Corey and Wipke introduced the first computer-aided synthesis planning program, known as Organic Chemical Simulation of Synthesis Corey1969Computer-AssistedSyntheses, which has continued to grow as LHASA PENSAK1977, SYNCHEM Gelernter1977EmpiricalSYNCHEM and WODCA Gasteiger2000Computer-assistedChemistry, among others. Such early retrosynthetic systems relied on hand-coded reaction rules or those algorithmically extracted from reaction databases. The applicability of a feasible reaction rule to a target product is assessed based on the presence of local structural or atomic features around the candidate reaction site in the rule set. Knowledge-based or, more recently, machine-learning algorithms have been used to judge which rule to select. Such prioritized transformation, for example, a rule to break bonds, is recursively applied to the current molecules to derive structurally simpler precursors until the growing retrosynthetic tree reaches readily available substrates.

Rule-based systems are interpolative by nature. Their predictability is no longer applicable if the underlying reaction mechanisms extend beyond the synthetic knowledge encoded in the rule set. To cover a broader reaction space, various deep-learning techniques have been introduced to retrosynthetic analyses in recent years, which can be categorized into rule-based two-step models Segler2017Neural, Coley2017Computer-AssistedSimilarity and fully data-driven end-to-end analyses Liu2017RetrosyntheticModels, Zheng2019PredictingNetworks, Lin2019AutomaticModels

. Their common goal is to identify the inverse mapping of a synthetic reaction from any given product to its unknown reactants using millions of training examples from known reactions. The former methods divide the task into two separate steps: the use of a trained machine-learning model that classifies an input product into one of dozens or hundreds of pre-compiled reaction templates, followed by the application of rules that are more likely to occur to deconvolve the target product retrosynthetically into simpler precursors

Segler2017Neural, Coley2017Computer-AssistedSimilarity

. The ability to predict the reaction outcomes has been significantly improved compared to earlier knowledge-based or logic-based models, owing to the high coverage of training reaction instances. However, these models still rely on reaction rules, resulting in limitations in the coverage of the predictable reaction spaces. To broaden the search space further, fully end-to-end approaches based on state-of-the-art neural machine translation systems have been developed, such as seq2seq

Liu2017RetrosyntheticModels and Molecular Transformer Schwaller2019MolecularPrediction. Once the chemical structures of products and reactants have been encoded by the Simplified Molecular Input Line Entry Specification (SMILES) chemical language Weininger1988SMILESRules, the task of predicting the retrosynthetic outcomes amounts to determining the rule of conversion from character-encoded products to character-encoded reactants. With a given synthetic target, such backward reaction prediction models can be used to generate a branched sequence of reactants recursively, until the growing retrosynthetic tree reaches a prescribed set of purchasable compounds. Several general-purpose best-first search algorithms can be used to identify chemically plausible synthetic routes from a potentially infinitely large search tree effectively, such as Monte Carlo tree search Segler2018PlanningAI.

It should be noted that the majority of candidate reactants simulated from such backward prediction models are rarely contained within a given set of purchasable compounds that span the feasible solution space. This is a major drawback of existing retrosynthetic methods. Another shortcoming arises from the low accuracy of backward reaction models. One reason for this is the loss of information in reaction data; the side products of synthetic reactions are often unrecorded in the datasets. To predict the reactants in a backward manner, it is necessary to reconstruct the missing structure of their product, which is impossible to achieve without using additional knowledge on structural transformation. As summarized in Table 1, the previously reported prediction accuracy ranged from 35% to 45%. This implies that a large number of candidate reactants in a simulated reaction sequence would be commercially unobtainable.

However, the forward prediction of reaction outcomes has achieved substantially higher levels of accuracy than those of retrosynthetic prediction Coley2017PredictionLearning, Jin2017PredictingNetwork, Schwaller2019MolecularPrediction. As summarized in Table 1, the reported prediction accuracy of state-of-the-art models is 90.4% for Molecular Transformer, which is twice the accuracy of backward prediction. The key concept of this study is rather obvious. A trained forward model with such high accuracy defines the mapping from a set of reactants to their product . By solving the inverse mapping with a synthetic target with respect to a pre-defined pool of commercially available reactants, there is a chance that the resulting retrosynthetic prediction will reach the same level of accuracy as that of the forward prediction model. The problem to be solved is a combinatorial optimization problem with the solution space subject to the combinatorial complexity of all possible pairs of purchasable reactants in the catalog. The complexity increases exponentially with the size of the candidate reactants, as well as the number of reaction steps considered.

In this study, we addressed the task of reaction mining within the framework of Bayesian inference, namely Bayesian retrosynthesis, which provides a principled means of inverting any given forward models into the retrosynthetic prediction system. To enhance the search efficiency and exhaustively enumerate alternative pathways, we developed a sequential Monte Carlo (SMC) algorithm using a surrogate model accelerator (see Figure 1 for a schematic view).

The Bayesian retrosynthesis algorithm successfully rediscovered 80.3% and 50.0% of known synthetic routes of single-step and two-step reactions within top-10 accuracy, respectively. Remarkably, the Monte Carlo method, which was specifically designed for the presence of diverse multiple routes, revealed a prioritized list of more than 400 reaction routes on average for each target. We investigated the potential applicability of such diverse candidates based on expert knowledge from synthetic organic chemistry. A Python implementation is available on GitHub Guo2020BayesianRetro.

Figure 1: Workflow of Bayesian retrosynthesis algorithm. A synthetic reaction model that forwardly predicts a product of any given reactants is inverted into the backward model through Bayes’ law of conditional probability. Monte Carlo sampling from the posterior distribution conditioned by a desired product provides a diverse set of highly probable reactant pairs, , which meet the requirement .
Task Model top-1 top-3 top-5 top-10
Backward similarity (Coley et al. 2017) Coley2017Computer-AssistedSimilarity 37.3 54.7 63.3 74.1
SCROP (Zheng et al. 2019) Zheng2019PredictingNetworks 43.7 60.0 65.2 68.7
Lin et al. 2019 Lin2019AutomaticModels 43.1 64.6 71.8 78.7
Forward template-based (Coley et al. 2017) Coley2017PredictionLearning 71.8 86.7 90.8 94.6
WLDN (Jin et al. 2017) Jin2017PredictingNetwork 79.6 87.7 89.2 -
Molecular Transformer (Schwaller et al. 2019) Schwaller2019MolecularPrediction 90.4 94.6 95.3 -
Table 1: Performance of existing deep neural networks on prediction of synthetic reactions in forward and backward manners. The top-1, top-3, top-5, and top-10 accuracies in [%] are presented for each.

2 Methods

2.1 Outline

A single-step reaction prediction model is a function that describes a product as a function of a set of reactants, denoted by . Using such a model, we can simulate a single-step reaction for any as . Herein, the function is treated as being deterministic. Solvents, reagents, and catalysts can be augmented into the input variable as required.

Likewise, a -step reaction sequence ending with a final product can be generated by convoluting the single-step model times with arbitrarily selected reactant sets at each step:

where denotes an intermediate product that is produced from the product at the previous step and the currently selected reactants . In general, the reaction prediction can be represented as , with all reactants concatenated into a single input symbol as .

The ultimate goal of the retrosynthetic prediction is to enumerate all possible satisfying for a given synthetic target , or equivalently, to solve the inverse mapping of the forward model . The solution space is composed of all reactant combinations in the purchasable compounds. The number of candidate reactants is typically of the order , resulting in the cardinality of the solution space being exploded to , in which reactants are placed in the synthetic route planning.

In certain cases, we aim to identify an ensemble of that meets the requirement approximately as , rather than obtaining the strict solution. Firstly, it is possible that all candidate reactants will never reach the target product with the currently provided model . Furthermore, even when the model is incorrect, true reactants are expected to be near the optimal solutions . In such scenarios, it is more beneficial to view the distribution of as approximately satisfying than to obtain only the strict solution. This is the underlying concept for handling the retrosynthetic analysis within the Bayesian framework.

The Bayesian retrosynthesis relies on Bayes’ law of conditional probability:

This law states that the posterior probability distribution

is proportional to the product of the likelihood and prior . The forward prediction model forms the likelihood function, which is the Boltzmann distribution with an inverse temperature , as follows:

The energy function is provided by the Euclidean distance or Tanimoto distance tanimoto1958elementary between the chemical structures of the target and predicted product . The distance is calculated with the extended-connectivity fingerprint (ECFP)Rogers2010Extended-connectivityFingerprints

with diameter 4 using the open-source cheminformatics toolkit RDKit

landrum2006rdkit in Python. Through the prior distribution, it is possible to place a low probability mass on unlikely reactants to narrow down the enormous solution space. We assigned equal probabilities to all candidates throughout this study so that for all .

The posterior is a discrete probability distribution , where denotes the indicator function that takes the value 1 if , and 0 otherwise. The support of this discrete measure consists of all possible combinations of reactants involved in a synthetic route. As the exact computation across candidates is generally infeasible, we explore the approximated form as

(1)

The primary objective in the Bayesian computation is to identify the reduced set of reactant pairs, , possibly with . Candidates with greater likelihoods should have a better chance of surviving, while ignorable with low likelihood values are effectively eliminated.

2.2 Difficulties of ordinal heuristic algorithms

Sampling from the posterior distribution over the huge discrete space is a considerably difficult task. Indeed, a conventional heuristic algorithm is no longer applicable to solve this task, as demonstrated in this study (see Supporting Information and Table S6). To observe this, we first present the simplest form of the sequential Monte Carlo (SMC) algorithm

Moral2006SequentialSamplers

. The SMC has a common algorithmic structure with genetic algorithms, which has been applied in various machine-learning or signal-processing problems. It begins with an arbitrary set of

particles, , which represents an ensemble of candidate reactant pairs, followed by the iteration of two operations, hereafter referred to as sampling and resampling.

1:Initialize a set of particles .
2:for  do
3:     for  do
4:         Generate a new particle according to a proposal distribution .
5:         Evaluate the importance weight according to the likelihood function
6:     end for
7:     Resample with the probability proportional to to obtain a new set .
8:end for
Algorithm 1 Simple SMC

To reveal the technical difficulties, we consider a simple SMC, as indicated in Algorithm 1. For a given particle set at step , a particle is tentatively replaced by a new according to a prescribed proposal distribution . Specifically, we sample with equal probability from the -nearest neighbors of in the candidate set of commercial reactants. For the neighbor search, all compounds in and reactants in are fingerprinted by ECFP with diameter 4, followed by the calculation of the Tanimoto distance. For each candidate particle, the goodness-of-fit to the synthetic target, which is referred to as the importance weight, is evaluated by simulating the product using Molecular Transformer. It should be noted that the importance weight is reduced to when using the -nearest neighbor proposal . Resampling of is then carried out based on the selection probability proportional to , which yields a new particle set . For a given , as the predicted product moves closer to the given synthetic target, its reactant set has a greater chance of surviving in the resampling step. By repeating the operation of sampling and resampling times, we obtain samples in with the likelihood , which form the approximated posterior (Eq. 1).

As described in Supporting Information, this simple SMC performed exceptionally poorly, as it failed to discover approximately 50% of known reactions in the performance test (Table S6), owing to two essential drawbacks that are common to ordinal heuristic search methods, including the genetic algorithm. One difficulty arose from the underlying diversity of the solutions. Our analysis demonstrated that a large number of synthetic routes ended with the same product in the search space. Using an extended algorithm, more than 400 different routes were identified for the synthetic targets on average. In general, it is difficult to identify a diverse set of highly probable solutions comprehensively using ordinal heuristic methods. In the SMC workflow, the step of repeated resampling induces a rapid loss of diversity among the particle sets, which is known as the problem of particle impoverishment Stavropoulos2001ImprovedSmoothing. Another issue of the ordinal methods arose from the cost of the repeated calculations of reaction prediction models. On average, the single-step reaction prediction using Molecular Transformer required 30 to 40 seconds on a Linux server with a NVIDIA V100 GPU or P100 GPU, thereby leading to a total execution time of over 7 hours under and . In summary, a surrogate model-assisted Monte Carlo algorithm was required to save the costs of repeatedly evaluating the computationally expensive reaction prediction model while maintaining the diversity of the particles, including highly probable solutions.

2.3 SMC accelerated by surrogate likelihood

A key concept behind our strategy is that, by using a computationally inexpensive surrogate model, such as gradient boosting regression

Schapire2012BoostingAlgorithms, we can approximately evaluate the likelihood function of Molecular Transformer for any given reactants with a high degree of accuracy. For instances of reactants, the likelihood values () are calculated using Molecular Transformer with any given target . Using this dataset, we train a gradient boosting regression tree , which can be used to predict the likelihood of without passing through Molecular Transformer (Figure 2a). In this study, a gradient boosting model was pre-trained on the training dataset. The ECFPs in the RDKit package were used as inputs for the gradient boosting regression based on the LightGBM package Ke2017LightGBM:Tree in Python. Figure 2b presents the performance of surrogate models for two different target molecules. The true reactants were scored as the most likely for each target molecule among the test data. The high correlation of the predicted and true likelihood made the surrogate likelihood a reliable alternative for the true likelihood, the calculation of which was expensive.

Figure 2: a. The energies (negative log-likelihood) predicted by the surrogate model are used to prioritize promising reactants before conducting expensive reaction prediction using Molecular Transformer. b. For two synthetic targets, the energies predicted by the surrogate model are displayed against the true values of 5k reactions in the test dataset. The orange points denote the ground-truth reactants. The dashed lines indicate the predicted values of the ground-truth reactants. The target molecules are displayed in the top right corners.

The proposed retrosynthesis algorithm is described in Algorithm 2. It relies on the evolutionary strategy involving the sampling and resampling operations, as in Algorithm 1. Three modifications are included in the sampling step to maintain the diversity of the final candidates. Firstly, the currently provided top- particles are expanded into particles by taking the -nearest neighbors of each candidate (line 7). This operation aims to increase the heterogeneity of the sample population in the following generation. The fast-to-evaluate surrogate model can efficiently prioritize increasingly heterogeneous particles according to the surrogate likelihoods (line 8). Moreover, the particles are clustered into subgroups according to the fingerprints (ECFP with diameter 4) of the chemical structures (line 9). The surrogate likelihoods are transformed into the cluster-level goodness-of-fit scores by taking the within-cluster average (line 10). As a result, a new set of particles is created by performing resampling of the particles, such that the number of particles in each cluster is proportional to the within-cluster likelihood (line 11). This cluster-level resampling is key to maintaining the population diversity. Finally, a set of randomly selected candidate reactants is augmented to the new set of particles (line 12). After calculating the true likelihoods of the surviving particles using Molecular Transformer, we proceed with the selection of the top- particles.

1:Initialize a set of particles: with likelihood .
2:Initialize the active candidate set , which consists of the indices of all possible reactant pairs.
3:Generate a training set to construct a surrogate likelihood function, in which Molecular Transformer is used to provide the likelihood for any .
4:Build a pre-trained model (gradient boosting) on as the surrogate likelihood .
5:for  do
6:     Select the top of particles from according to the descending order of likelihood values in .
7:     Generate particles by taking the -nearest neighbors of each () selected from .
8:     Calculate the surrogate likelihood of .
9:     By performing -means clustering with the fingerprints of , the temporally proposed particles are grouped into non-overlapping clusters .
10:     Calculate the cluster-level surrogate likelihood as .
11:     Resample particles from , denoted by , such that the number of particles in each is proportional to .
12:     Generate the remaining particles, , with equal probabilities, from the entries in .
13:     Set the new particles as .
14:     Calculate the likelihood of all entries in .
15:     The indices of the current particles are removed from the active candidate set .
16:end for
17:Return and () to calculate the approximate posterior Eq. 1.
Algorithm 2 Surrogate-assisted SMC

2.4 Ranking and prioritization

In this study, we demonstrated that the SMC algorithm generally discovers an excessive number of hypothetical routes to a given product; in many cases, several hundreds or more for single-step reactions. The majority of such candidates are chemically unrealistic and false discoveries, which possibly results from the unavailability of failed reactions or data of low-yielding reaction in the machine-learning workflow. Indeed, our expert chemists recognized 60% of the detected reactant pairs as having exceedingly low or no reactivity (Table S9). Here, we consider two means of prioritizing more promising candidates by using a heuristic ranking method or reaction-type grouping.

The ranking method scores a given pair of reactants as

The first term represents the probability of the tokenized SMILES symbols, which is provided by Molecular Transformer. In the second term, indicates the probability of belonging to a prescribed known reaction class . In this study, we consider 10 reaction classes defined by Schneider and coworkers Schneider2016WhatsAssignment, as described in Table S1. A total of 50k reactions in the United States Patent and Trademark Office (USPTO) dataset DanielMarkLowe2012ExtractionLiterature

were categorized into the 10 reaction classes. A sparse logistic regression model

Friedman2009Glmnet:Models was trained on a randomly selected 80% portion of the dataset, in which the ECFPs of and were calculated with diameter 4 and concatenated to obtain a descriptor. The prediction accuracy reached 97.3% on the test set. The trained model was used to evaluate and the most probable reaction class was selected to define the score. The underlying concept of using the hand-designed heuristic score was that candidate reactions that were highly discriminable into one of the known reaction classes were expected to be reliable.

Another method to choose the promising candidates is based on the grouping of reaction patterns. In this study, t-SNE VanDerMaaten2008VisualizingT-SNE was performed on the augmented fingerprints of all given , and X-means clustering DauPelleg2000X-means:2000 was used to automate the determination of the number of clusters and grouping of reaction patterns. For each cluster, we selected a representative reaction that exhibited the best within-cluster score. In this manner, we could infer how many different types of synthetic routes potentially existed or were feasible to design with a given set of purchasable compounds.

3 Results

3.1 Data

We used a collection of 50k single-step reactions Schneider2016WhatsAssignment that were extracted from nine million patent applications and issued patents from 1976 to 2016. This dataset has served as a benchmark set in existing studies to evaluate retrosynthesis methodsLiu2017RetrosyntheticModels, Coley2017Computer-AssistedSimilarity, Zheng2019PredictingNetworks, Lin2019AutomaticModels. In this study, the dataset was used to train a forward model and to create the ground-truth sets of the single-step and two-step reactions, as described below. Following a previous studyLiu2017RetrosyntheticModels, we split the dataset into 80% for training, 10% for validation, and 10% for testing. Note that the recorded reactions were classified into 10 reaction classes according to an expert system Schneider2016WhatsAssignment (Table S1). While most existing methods have employed one of the pre-defined reaction classes as input for the retrosynthetic prediction system to narrow it down to a limited solution space, we performed the Bayesian algorithm with no such prior knowledge. The solution space that we considered hypothetically as a pool of purchasable compounds was spanned by all possible combinations of approximately 600k reactants in the USPTO dataset.

3.2 Forward prediction model

We used a pre-trained Molecular Transformer Schwaller2019MolecularPrediction, which was trained on a dataset from USPTO. This attention-based neural translation model defined a translation between the SMILES strings of reactants and their product. For the sake of simplicity, the reagents were removed from the input. Any number of reactants, which were separated by “.”, took part in the input. All of the reactions were canonicalized using RDKit. The inputs were tokenized with the regular expression according to a previous study Schwaller2019MolecularPrediction. With the direct application to our test set, the distributed pre-trained model achieved top-1 accuracy of 70.7% and top-5 accuracy of 86.4%. However, with a fine-tuned model using the training and validation data, the top-1 accuracy reached 86.9% and the top-5 accuracy reached 95.5%.

3.3 Implementation

The Bayesian retrosynthesis algorithm was implemented in Python (version 3.6.8), coupled with RDKit and sckit-learn. Molecular Transformer built using PyTorch was plugged into the forward model

Schwaller2019MolecularPrediction. All experiments were run on the AI Bridging Infrastructure (ABCI) at National Institute of Advanced Industrial Science and Technology. The computation was performed on executions with 25 nodes and 100 NVIDIA Tesla V100 devices.

3.4 One-step retrosynthesis

For the one-step retrosynthesis prediction, we randomly selected 100 test reactions from the test set consisting of one or two reactants (all of the reaction SMILES are presented in Table S7). Among the 100 reactions, Molecular Transformer could predict the true products to be top-1 candidates for 87 reactions. For each reaction tested, we performed the Bayesian retrosynthesis algorithm 10 times to evaluate the average detection rate. The number of particles was set to 1,000. In the first 100 steps of the SMC, each particle consisted of one reactant to explore the one-reactant reaction space. The reaction space of two reactants was explored in the subsequent 500 steps, in which a combination of two reactants constituted a particle. The SMC algorithm was used to perform a total of 600,000 () evaluations of the forward reaction models in each test reaction, which corresponded to approximately 0.0001% of the entire search space. The Bayesian retrosynthesis algorithm identified multiple synthetic routes ending with 98.4% of the 100 target molecules. Ground-truth reactants were found in 88.3% of all test reactions. Focusing on the 87 reactions in which Molecular Transformer was predictable, the ground-truth reactants were found with a 94.5% success rate (Table 2).

Number of
reactions
Detection of reactants ending
with target product [%]
Inclusion of ground-truth
reactants [%]
Random100 100 98.4 88.3
MT-predictable 87 99.1 94.5
Table 2: Performance of surrogate-accelerated SMC on randomly selected 100 ground-truth reactions (“Random100”). “MT-predictable” denotes the subset (87 reactions) in which Molecular Transformer could forwardly predict their products as top-1 candidates. The average success rate for detecting one or more synthetic routes ending with each target product is indicated in the third column. The fourth column denotes the average rate at which the ground-truth reactions were included in all detected routes.

In this experiment, the Bayesian retrosynthesis algorithm revealed more than 400 synthetic routes on average for each design target. Figure 3 presents the diversity of the detected reaction routes on two synthetic targets. The ECFPs of reactants in were reduced to an augmented descriptor as , and these were embedded in the two-dimensional subspace for visualization using the t-SNE algorithm. The distribution of the projected reactants indicates that there would be multiple motifs of the synthetic routes to the same product. The comprehensive detection of the candidate synthetic routes and their visualization-based summary may be helpful for facilitating chemists’ creativity and decision-making in synthetic route design.

It is possible that most of such large amounts of candidate routes would be false findings. To narrow down the candidates, we performed the ranking procedure as described in the Methods section. For each of the 100 ground-truth reactions, we obtained a ranked list of the top- most probable candidates and investigated whether to include the ground-truth reactants. As summarized in Table 3, the top- accuracy of the Bayesian retrosynthesis outperformed the state-of-the-art models for any ; in particular, 77.0% and 80.3% of the top-5 and top-10 accuracies were reached, respectively. Within the 87 ground-truth reactions in which Molecular Transformer successfully predicted the products in the forward manner, the top-5 and top-10 accuracies reached 86.0% and 89.4%, respectively.

According to the comparison presented in Table 3, this result also outperformed other existing methods using the reaction class explicitly as an extra input in the retrosynthetic prediction. With a given reaction class, such prediction models can narrow down the reaction space into a focused region to enhance the prediction performance. However, such prior knowledge is rarely available in real applications. As a reference, we performed a modified Bayesian retrosynthesis. Given a true reaction class and its class probability , we calculated the ranking score as instead of taking the maximum with respect to the classification probabilities, as shown in the Methods section. The top-5 and top-10 accuracies were further improved to 81.4% and 83.5%, respectively, for the 100 ground-truth reactions (Table 3).

Data Model top-1 top-3 top-5 top-10
Without reaction class similarity (Coley et al. 2017) Coley2017Computer-AssistedSimilarity 37.3 54.7 63.3 74.1
SCROP (Zheng et al. 2019) Zheng2019PredictingNetworks 43.7 60.0 65.2 68.7
Lin et al. 2020 Lin2019AutomaticModels 43.1 64.6 71.8 78.7
Bayesian-Retro 47.5 67.2 77.0 80.3
2-6 Bayesian-Retro (MT-predictable) 54.6 74.9 86.0 89.4
With reaction class baseline (Liu et al. 2017) Liu2017RetrosyntheticModels 35.4 52.3 59.1 65.1
seq2seq (Liu et al. 2017) Liu2017RetrosyntheticModels 37.4 52.4 57.0 61.7
similarity (Coley et al. 2017) Coley2017Computer-AssistedSimilarity 52.9 73.8 81.2 88.1
SCROP (Zheng et al. 2019) Zheng2019PredictingNetworks 59.0 74.8 78.1 81.1
Lin et al. 2020 Lin2019AutomaticModels 54.6 74.8 80.2 84.9
Bayesian-Retro 55.2 74.1 81.4 83.5
2-6 Bayesian-Retro (MT-predictable) 62.8 81.8 89.7 91.7
Table 3: Performance of various retrosynthetic prediction methods with or without reaction class labels. “Bayesian-Retro” denotes the proposed method; “MT-predictable” denotes the performance on the 87 ground-truth reactions with their products forwardly predictable by Molecular Transformer. The top-1, top-3, top-5, and top-10 accuracies in [%] are indicated for each case.
Figure 3: Distributions of candidate reactants for two synthetic targets (a and b) visualized based on projection to 2D space using t-SNE. a. The t-SNE projection of 3,559 candidate reactants ending with the target molecule (indicated in the top left corner) is shown. Here, denotes the ground truth reaction. In the left panel, the data points are color-coded according to the ranking scores normalized to . In the right panel, the X-means clustering of candidate reactants fingerprinted by ECFP. The optimal number of clusters is determined as 29. The cluster memberships are indicated with different colors in the t-SNE plot. b The t-SNE projection and X-means clustering of the other target molecule. A total of 2,236 candidate reactants are grouped into the 36 clusters.

3.5 Multi-step retrosynthesis

The Bayesian retrosynthesis algorithm was applied to the design of two-step synthetic routes. By combining the single-step reactions that were predictable with Molecular Transformer in the test reaction set, we generated a ground-truth set of two-step reactions as follows: If a product of a recorded reaction appeared in a different reaction as a reactant, the two reactions were connected to form a two-step synthetic route, as the product of the former reaction was involved as a reactant of the second-step reaction as an intermediate product. In this manner, we obtained 11 two-step reactions. Our expert chemists verified the validity of these reactions, as summarized in Table 4 and Table S8. According to their evaluations, in which unrecorded reagents and reaction conditions were inferred based on expert knowledge, the first step in reaction 3 was judged as chemically unrealistic. In this case, instead of excluding reaction 3 from the ground-truth set, we tested whether the Bayesian retrosynthesis could determine alternative synthetic routes to the target product.

The Bayesian retrosynthesis algorithm was performed 10 times for each reaction. The number of particles was set to 2,000, and each particle consisted of two and one reactants in the first and second reactions, respectively. The number of steps in the SMC was set to 1,000, constituting a total of searches in each test case. This number corresponded to approximately of the entire search space. In 9 of the 11 reactions, the recorded reactants could successfully be identified at least once among the 10 repeated tests. More than 11,000 candidate routes were identified on average for each target product. Hence, we aggregated all of the candidate synthetic routes and performed the ranking procedure. The recorded reactions were ranked as the top 10 candidates in five cases and as the top 100 in the other cases (Table 4).

To observe the distribution of the candidate synthetic routes, the t-SNE projection of the detected reactants for reaction 9 is presented in Figure 4a. Candidate reactants closer to the recorded reactants exhibited higher scores. To identify the different motifs in the candidate synthetic routes, X-means clustering was applied to the ECFPs of the reactants, which were grouped into 98 clusters. We investigated the synthesis feasibility of the 10 reactants exhibiting the highest score in each of the 10 different clusters (Figure 4a). According to the evaluations by the expert chemists, 7 out of the 10 proposed routes would be chemically reactive and synthesizable. In candidate routes 1 to 3, the first and second steps were known as Williamson ether synthesis QJ8520400229 and a palladium-catalyzed coupling reaction Miyaura1995Palladium-CatalyzedCompounds, respectively. In candidate route 9, the second step consisted of ether synthesis. It should be stressed that ranking only by the score does not always reveal such promising synthetic routes. It is important to extract a diverse set of candidates based on the clustering procedure to enhance chemists’ creativity.

For ground-truth reaction 3, which the chemists judged to be chemically infeasible, the Bayesian retrosynthesis algorithm identified 2,280 alternative synthetic routes to the target molecule (Figure 5). Based on our ranking and clustering procedures, 10 synthetic routes were selected and two reactions with different side-chain modifications were judged as reactive and synthesizable according to chemists’ evaluations (Figure 5b). Moreover, a ring-forming synthesis was proposed by the algorithm (the route indicated at the top of Figure 5c). Although the orthoformic acid monoester used in the first step was chemically unstable, the proposed synthetic route could be helpful for chemists to consider the strategy of designing alternative synthetic routes. Indeed, a different ring-forming synthetic route was manually designed by using formaldehyde instead of the orthoformic acid monoester.

No. Reaction Step 1 Step 2
No. of
candidates
Ground
truth
Rank
1 1 1 784 F -
2 1 1 1,326 T 2
3 3 1 2,280 F -
4 1 1 18,197 T 5
5 1 1 28,388 T 94
Table 4: Ground-truth set consisting of two-step synthetic routes to 11 targets. The score , which is assigned to each reaction step, indicates ‘feasible,’ ‘contentious’ or ‘infeasible,’ as judged by the chemists. The Bayesian retrosynthesis was performed 10 times for each reaction. The fifth column denotes the total number of candidate routes ending with the target molecule in the 10 trials. T and F denote the presence or absence of the recorded reaction in the detected synthetic routes. The final column indicates the rank of the ground truth predicted by the ranking algorithm.
No. Reaction Step 1 Step 2
No. of
candidates
Ground
Truth
Rank
6 1 1 48,601 T 3
7 1 1 7,622 T 45
8 1 1 2,167 T 17
9 1 1 6,613 T 6
10 1 1 3,122 T 23
11 1 1 4,838 T 8
Figure 4: a. t-SNE projection of 6,613 candidates for two-step synthetic routes to target product in reaction 9, where denotes the ground-truth reaction route. In the left panel, the data points are color-coded according to the scores normalized to . In the right panel, the X-means clustering classified the 6,613 candidate routes into 98 groups, which are mapped on the t-SNE plot. The identified clusters are indicated in different colors. denotes the 10 candidate routes presented in b. b. 10 candidate routes belonging to different clusters. A score is assigned to each reaction step, indicating ‘feasible,’ ‘contentious’ or ‘infeasible,’ as judged by expert chemists.
Figure 5: a. Two-step reaction 3 in test dataset. The score denotes chemists’ judgments on the synthesis feasibility. b. Two alternative synthetic routes proposed by Bayesian retrosynthesis algorithm. c. The top reaction is a ring-forming synthetic route proposed by Bayesian retrosynthesis algorithm. The first step is infeasible because the orthoformic acid monoester is chemically unstable. The bottom reaction is a ring-forming synthetic route suggested by chemists based on the proposed synthetic route indicated at top.

4 Conclusions

We developed a Bayesian retrosynthesis algorithm. Most previous studies have focused on the direct prediction of the reaction inputs backwardly from a target product. In general, the backward prediction task is significantly more difficult than that of forward prediction, as the model needs to reconstruct several building blocks of reactant molecules that are generally missing in the target product. Moreover, most reactants resulting from such backward prediction will not be available for purchase and the candidate itself becomes a synthetic target. To overcome these obstacles, we firstly overhauled the problem of retrosynthetic prediction. We obtained a forward reaction model to define the mapping from reactants to products, achieving a high level of accuracy. Thereafter, the retrosynthetic prediction was addressed by exploring the inverse mapping from a target product to a pair of reactants in the given forward model, in which all possible pairs of purchasable compounds spanned the feasible solution space. The prediction accuracy on benchmark datasets outperformed the current state-of-the-art methods, as 80.3% and 50% of the known test reactions were successfully rediscovered within top-10 accuracy for the single-step and two-step synthetic route planning, respectively.

The Bayesian retrosynthesis algorithm revealed the presence of numerous alternative routes towards the target product, which were programmed in the trained reaction prediction model. The identification of such diverse candidate routes may be helpful for chemists to facilitate their creative works in synthetic organic chemistry. However, our expert chemists concluded that nearly 60% of the proposed two-step reactions would be false discoveries owing to no or exceedingly low reactivity. The prediction of the presence or absence of reactivity is currently beyond the capabilities of any synthetic prediction models, because these are trained only on instances from highly reactive reactions in the published data. The lack of negative data on failed reactions or low-yielding reactions prevents us from obtaining machine-learning models to discriminate the presence or absence of reactivity in a candidate synthetic route. Several previous studies generated artificial negative examples by perturbing and shuffling the reported known reactions. In this study, we introduced a heuristic rule for the ranking and prioritization of candidate reaction routes. However, none of these methods can solve the problem at a fundamental level, and eventually, we have to create a comprehensive dataset of negative reactions from experimental observations in laboratory synthesis, the literature, chemists’ hand-coded heuristics or high-throughput quantum chemistry calculations.

5 Acknowledgement

This work was supported in part by the Materials Research by Information Integration Initiative (MII) of the Support Program for Starting Up Innovation Hub from the Japan Science and Technology Agency (JST). Ryo Yoshida acknowledges the financial support received from a Grant-in-Aid for Scientific Research (A) 19H01132 from the Japan Society for the Promotion of Science (JSPS), JST CREST Grant Number JPMJCR19I1, Japan, and JSPS KAKENHI Grant Number 19H05820. Stephen Wu acknowledges the financial support received from JSPS KAKENHI Grant Number JP18K18017.

  • Figure S1: Detected candidates on two-step reaction routes to target products in reactions 1, …, 11.

  • Table S1: Reaction classes used to label 50k reactions and numbers of reactions in each class.

  • Table S2: Top-10 accuracy of various retrosynthetic prediction methods when reaction class labels are given.

  • Tables S3, S4 and S5: Parameters and experimental conditions for simple SMC, surrogate-assisted SMC (one-step) and surrogate-assisted SMC (multi-step).

  • Table S6: SMILES and detailed results of 40 reactions for performed test of simple SMC.

  • Table S7: SMILES and detailed results of 100 reactions used in one-step retrosynthesis.

  • Table S8: SMILES of 11 reactions used in multi-step retrosynthesis.

  • Table S9: Detailed results of 11 reactions used in multi-step retrosynthesis.

References