Value-Added Chemical Discovery Using Reinforcement Learning

Computer-assisted synthesis planning aims to help chemists find better reaction pathways faster. Finding viable and short pathways from sugar molecules to value-added chemicals can be modeled as a retrosynthesis planning problem with a catalyst allowed. This is a crucial step in efficient biomass conversion. The traditional computational chemistry approach to identifying possible reaction pathways involves computing the reaction energies of hundreds of intermediates, which is a critical bottleneck in silico reaction discovery. Deep reinforcement learning has shown in other domains that a well-trained agent with little or no prior human knowledge can surpass human performance. While some effort has been made to adapt machine learning techniques to the retrosynthesis planning problem, value-added chemical discovery presents unique challenges. Specifically, the reaction can occur in several different sites in a molecule, a subtle case that has never been treated in previous works. With a more versatile formulation of the problem as a Markov decision process, we address the problem using deep reinforcement learning techniques and present promising preliminary results.



There are no comments yet.


page 1

page 2

page 3

page 4


Deep Reinforcement Learning of Transition States

Combining reinforcement learning (RL) and molecular dynamics (MD) simula...

Modern Hopfield Networks for Few- and Zero-Shot Reaction Prediction

An essential step in the discovery of new drugs and materials is the syn...

Learning retrosynthetic planning through self-play

The problem of retrosynthetic planning can be framed as one player game,...

Self-Improved Retrosynthetic Planning

Retrosynthetic planning is a fundamental problem in chemistry for findin...

Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular Design

Molecular design and synthesis planning are two critical steps in the pr...

Adjust Planning Strategies to Accommodate Reinforcement Learning Agents

In agent control issues, the idea of combining reinforcement learning an...

Synthesizing Chemical Plant Operation Procedures using Knowledge, Dynamic Simulation and Deep Reinforcement Learning

Chemical plants are complex and dynamical systems consisting of many com...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chemical transformation is the basis of every aspect of industrial processes including the production of drugs, chemicals, and transportation fuels. Artificial intelligence—in particular, machine learning (ML)—and improved materials understanding present a unique opportunity to provide design rules for utilizing easily accessible carbon reserves in the world by transforming them to value-added chemicals. In order to enable and maximize these chemical transformations, a detailed understanding of the mechanistic steps and knowledge of shortest viable discovery pathways are essential. Existing discovery approaches, however, are either manually driven or based on trial and error. Automatically discovering transformation pathways by using ML has the potential to revolutionize and accelerate the discovery of chemicals and novel reaction pathways.

Through various chemical transformations carbon, oxygen, and hydrogen atoms of biomass can be utilized to form useful candidates. In this regard, automated data-driven adaptive algorithms can play a crucial part in optimizing the desired pathways for the production of novel compounds or identifying viable and cost-effective synthetic routes. To demonstrate this, we have chosen an example of aqueous acid-catalyzed conversion of a fructose molecule to a value-added compound, hydroxy methyl furfural (HMF). This transformation is equivalent to three consecutive dehydration reactions (removal of water molecule) from fructose (shown in Figure (a)a). We have developed an automatic reaction pathway generator (in Python) based on chemistry rules. This code utilizes RDKit Landrum and others (2006)

, an open-source cheminformatics software kit that includes an implementation of chemical reactions based on the SMILES arbitrary target specification (SMARTS). We have postulated rules for reactions associated with the carbohydrate chemistry: (a) protonation/deprotonation, (b) dehydration, (c) hydride shift, (d) ring opening (C-O bond cleavage) upon protonation, (e) ring closure (C-O bond) formation upon protonation, (f) ring contraction/expansion (5-6-7 membered rings), (g) keto-enol transformation, (h) addition of water on keto group to form diols, and (i) formation of formic acid from terminal diols.

Notable recent ML approaches for molecular structure design with sequential chemical transformation stem from the work of Segler et al. Segler et al. (2018) and Coley et al. 1

, where a reaction template-based Monte Carlo tree search approach and graph convolutional neural-network-based supervised learning approach are adopted, respectively. Schrek et al.

Schreck et al. (2019) recently used deep reinforcement learning to determine optimal reaction paths, an approach that has great potential for synthesis of unfamiliar molecules. However, all previous works implicitly assume that there is one reaction center possible within a molecule given a particular action template, or they never explicitly state how they handle the multireaction center case. Furthermore, both Segler et al. (2018) and Schreck et al. (2019) mentioned that the quality of the reaction template or the choice of the template is one of the major reasons for the success. Unfortunately, a molecule with several reaction centers given a single reaction template in the sense of 11; 8; 1

is ubiquitous in our scenario, indicating that the crux of our problem is fundamentally different from what the previous works were able to address. Even with computational heavy quantum chemistry methods, one can have a not-quite-accurate estimation of which reaction center has the best chance. This situation motivates us to construct a more versatile formulation of the problem as a Markov decision process, in the hope that the agent will be able to implicitly learn the underlying probability distribution of the reaction centers through self-play.

(a) Acid-catalyzed aqueous chemistry: fructose to HMF
(b) Protonating fructose at different reaction centers leads to distinct offspring
Figure 1: Illustration of chemical reactions on the fructose molecule.

2 Reinforcement learning for chemical synthesis

We formulate the chemical synthesis problem as a Markov decision process (MDP) Sutton and Barto (1998) to make it amenable to the use of reinforcement learning techniques. An MDP is a tuple (, , , ), where denotes the state space, the action space, the transition model, and the reward function , respectively. In our study, a state is a set of molecules, and an action is one of the reactions from (a)–(i) introduced in the preceding section. We chose to have the action space vary with the states because although only one SMART template represents each type of reaction, the actual reaction can happen at any site that abides by the chemistry rules. For example, a fructose molecule has six distinct sites (hydroxyl group) where protonation can happen, with different probability determined by the molecule structure and thermodynamics property, which is shown in Assary et al. (2012) to be an important factor. Previous works, including those of Segler et al. Segler et al. (2018) and Schrek et al. Schreck et al. (2019)

, classify actions only up to the reaction center. The major concern in both papers was choosing the most likely reaction template at each state. While such simplification is appropriate for most chemical reactions, capturing the underlying probability distribution is simply not enough when several possible sites are present, as is crucial in our scenario. As a concrete example, a fructose molecule and two distinct reactive offspring from protonation are shown in Figure

(b)b. Both Segler et al. (2018) and Schreck et al. (2019) would represent these as a single action, whereas our formulation distinguishes the reaction centers.

The agent interacts with this environment by choosing a sequence of actions starting from the initial state and receives a positive reward if the goal state is reached within the maximum steps allowed. Otherwise a negative reward is used to penalize the choices made. The goal of the agent is then to learn an optimal policy function in order to maximize the rewards.

3 Experiments

Our approach starts by reading the SMILES string of the parent (fructose) and applying the protonation rule. Doing so is equivalent to the first step of the acid-catalyzed reaction. For example, fructose has six oxygen atoms; therefore, fructose would have six unique reaction centers and protonation results in the formation of six reactive offspring from fructose. Starting from three initial reactants (fructose, water, and proton), all reaction rules are applied to each reactant one by one. As product species are generated, they are added to the current reactant pool if they are not already in it. The process propagates until no new product can be formed or the reactant list cannot be updated any further. If we account only for the products with oxidation s1 or less (neutral), 2,500 reactions can be generated from the initial three reactants. The initial and goal state have been tailored to this data set. We use fructose as initial state and HMF as a goal state. We note that most of the reactions in the generated data set are reversible. By reversing the actions, the characterization of the data set changes, allowing us to test the agent’s ability to generalize. After reversing the reactions, we run an experiment with fructose as the goal state and HMF as initial state, respectively.

Both the original data set and its reversed variant are manually generated. They are meant to simulate the environment on a smaller scale for validation before the full-scale study. Eventually, the available actions will be generated by determining the possible sites given the current molecule by the algorithm in an ad hoc fashion.

The molecules are represented by using a Morgan fingerprint folded to 1,000 dimensions, prepared by using RDKit Landrum and others (2006). The Morgan fingerprint is shown to be similar to ECFP4 in most cases according to the description in the online documentation of RDKit Landrum and others (2006). To set up the experiment, we implement an OpenAI gym Brockman et al. (2016) environment, and we train the policy network with the Proximal Policy Optimization (PPO) algorithm Schulman et al. (2017). The policy network is modeled by a 128-unit LSTM network Hochreiter and Schmidhuber (1997). We believe the choice of policy network and training algorithm is appropriate for the following reasons. If we treat our environment as a graph, with states as nodes and actions as edges, or pairs of reactions, for example protonation/deprotonation, potential loops can be created. Even longer loops are feasible. We would like the agent’s policy network to remember the actions it had taken before; and, by receiving negative rewards, the agent would learn to avoid such loops. As for using PPO to update the policy, we note that rewards are given only at the end of each trajectory.

PPO is known to work well on sparse reward problems OpenAI . Compared with other policy gradient algorithms such as TRPO Schulman et al. (2015) or DDPG Lillicrap et al. (2015)

, it is easily scalable, is more robust, and needs little hyperparameter tuning

Schulman et al. (2017).

A trajectory is defined as an attempted path from the initial state to a goal state. Rewards are received only once in each trajectory. Because of the nature of chemical reactions, it is impractical to consider synthesis paths longer than steps. Let denote the number of steps. A trajectory has three possible outcomes: (1) the agent reaches goal state at steps and receives a reward of ; (2) the agent reaches a dead-end state, where no more reactions can happen, and receives a reward of , or (3) the agent does reach goal state or dead-end state within steps and receives a reward of .

4 Results

(a) fructose to HMF
(b) HMF to fructose
Figure 2: Length of shortest paths with different start states.

In Figures (a)a and (b)b, the number of steps of shortest paths our agent found was plotted against the number of trajectories. The initial impression is that the agent does gradually learn the shortest paths, but the learning experience varies depending on the start state. The forward direction (fructose to HMF) converges more slowly, possibly because at many states there are more choices of action. During our experiments we found that the performance of the agent is consistent every time we retrain the policy network; therefore no average is taken.

Figure 3: One shortest fructose-HMF reaction sequence identified by reinforcement learning.

Our experiment, although simple, has demonstrated great potential. Data exploration shows that the test data set and its reverse have varied characteristics in terms of the maximum number of actions available across all states and the umber of dead-end states. The agent performed well nonetheless in both cases. Moreover, the agent has little knowledge about the underlying chemistry other than that all molecules are represented by Morgan fingerprints. Not only is the human factor removed from the discovery process, but the training overhead found in Segler et al. (2018) is also avoided. The simplicity does not mean the performance is compromised. In fact, one of the shortest paths found in Figure 3 is identical to one of the shortest paths identified by chemists in Assary et al. (2012). To compute the thermodynamic landscape as in Assary et al. (2012) is in general difficult. The agent has demonstrated the ability to learn through self-play, which is extremely helpful in our scenario.

5 Discussion

This work is only the beginning of an exciting project. We point out some future directions that are worth exploring. Our next step is to assess how this approach generalizes to other initial/goal state configurations as well as to larger data sets. We are developing methods to compute possible actions at each state instead of generating them manually, which would become infeasible with various different start states. The molecule structure largely determines the likelihood of the site at which a particular reaction is going to happen. By converting to Morgan fingerprints, however, this structural information is partially lost in the translation. Therefore, we are planning to change the representation of molecules to graphs so that the agent may be able to learn more directly from the representation rather than using Morgan fingerprints. Following up the previous idea, we hypothesize that a pretrained network for predicting sites, ideally working directly with the graph representation, will help the agent learn faster in a larger data set.


This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory. This work was conducted in part by the Computational Chemistry Physics Consortium (CCPC), which is supported by the Bioenergy Technologies Office (BETO) of Energy Efficiency & Renewable Energy (EERE). P. Jiang was funded by NSF through the MSGI program during her time at Argonne National Laboratory.


  • [1] Cited by: §1.
  • [2] R. Assary, T. Kim, J. Low, J. Greeley, and L. A Curtiss (2012-08) Glucose and fructose to platform chemicals: understanding the thermodynamic landscapes of acid-catalysed reactions using high-level ab initio methods. Physical Chemistry Chemical Physics : PCCP 14, pp. . External Links: Document Cited by: §2, §4.
  • [3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. External Links: arXiv:1606.01540 Cited by: §3.
  • [4] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Document, Link, Cited by: §3.
  • [5] G. Landrum et al. (2006) RDKit: open-source cheminformatics. Cited by: §1, §3.
  • [6] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. M. O. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. CoRR abs/1509.02971. Cited by: §3.
  • [7] OpenAI OpenAI Five. Note: Cited by: §3.
  • [8] J. S. Schreck, C. W. Coley, and K. J. M. Bishop (2019/06/26) Learning retrosynthetic planning through simulated experience. ACS Central Science 5 (6), pp. 970–981. Note: doi: 10.1021/acscentsci.9b00055 External Links: Document, ISBN 2374-7943, Link Cited by: §1, §2.
  • [9] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015-07–09 Jul) Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1889–1897. External Links: Link Cited by: §3.
  • [10] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. ArXiv abs/1707.06347. Cited by: §3, §3.
  • [11] M. H. S. Segler, M. Preuss, and M. P. Waller (2018/03/28/online) Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, pp. 604 EP –. External Links: Link Cited by: §1, §2, §4.
  • [12] R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §2.