1 Introduction
Consider environments represented by a Markov Decision Process
with discrete states and actions . Deep Reinforcement Learning has been successful in tasks with moderate state and action space sizes where cheap and perfect environment simulators are available, or where the agent can even be placed in the real environment , as in many games. Agents can then be trained by interacting many times with the (simulated) environment.Unfortunately, in many important tasks of real life interest, this framework cannot be easily applied. First, interactions with the real environment can be very expensive, and even potentially dangerous. Obviously, it would be prohibitive to let an agent initially drive randomly on public roads, or to perform unrestricted scientific wetlab experiments, which can cost 1000s of pounds (DulacArnold et al., 2019).
It has been recognised early that it is possible to train models as simulators of the environment dynamics based on limited interaction data, for example recorded trajectories or experiments which have been performed in the past (Schmidhuber, 1990; Ha & Schmidhuber, 2018). However, canonical modelbased RL still requires that the action space is known and can be provided easily by the programmer.
More generally, the learner does not have information available about the possible actions they can take, since it can be too complex to specify what actually is, similar to common sense knowledge in expert systems. The learner then first has to learn the nature of the elements of the set such that they are compliant with observations of subsequent states.
Second, in contrast to board games or image representations which are of fixed size, in a more general setting the state space is compositional, constructed of multiple entities and their relations, which means it can potentially have countably infinite cardinality , which requires the learner to exhibit an appropriate inductive bias. Agents with representations based on graphs can naturally tackle compositionality in environments, and have been successful in chemical (Segler et al., 2017, 2018) and physical (Hamrick et al., 2018; Bapst et al., 2019) reasoning and planning tasks.
In the frontiers section of their book Sutton & Barto (2018) succinctly state that “we still need scalable methods for planning with learned environment models. Planning methods have proven extremely effective in applications such as AlphaGo Zero and computer chess in which the model of the environment is known from the rules of the game or can otherwise be supplied by human designers. But cases of full modelbased reinforcement learning, in which the environment model is learned from data and then used for planning, are rare.”
Here, we describe a general formalism to address this issue for learning and planning in compositional state spaces with a priori unknown actions, as merger of three components: 1) The induction of the action space as a graphbased world program from a small set of statestate transitions, 2) learning of neural network policies and dynamics models, and 3) planning using these networks and the world program. We further highlight its successful application, and future challenges for the RL community.
2 Formalism
Consider the problem of nearoptimal decision making in infinite, discrete, compositional state and action spaces, where a simulator of the environment dynamics is not available. Furthermore, consider that the learner does not know the elements of the action space a priori. The only clue are limited amounts of batch data in the form of statestate pairs from (previous) observations of the environment. In the framework of Markov Decision Processes (Sutton & Barto, 2018) we then have
actions  (1)  
available actions in state  (2)  
policy  (3)  
dynamics  (4)  
reward  (5) 
where or indicate that this set or mapping has to be learned. For the sake of simplicity, we assume is given.^{1}^{1}1One could consider learning the reward function via inverse RL (Ng et al., 2000).
2.1 GraphBased Environments
We consider compositional environments, where states are composed of multisets of graphs.^{2}^{2}2With some loss of generality, these sets of graphs may be merged into a single graph with multiple disconnected components. However, this makes it more challenging to treat different groups of disconnected components. Graphs in turn are composed of vertices (nodes) and edges , which model the relationship between the vertices. Graphs may be vertex or edgelabeled with arbitrarily structured labels, and can be directional. This subsumes for example grids (lattice graphs, images), text (linear chain graphs), computer code (abstract syntax trees), molecules, or physical systems. Let be the set of all possible graphs , which are defined by an unrestricted graph grammar (Rozenberg, 1997). The state space , where denotes all submultisets formed by the elements of .^{3}^{3}3This extends power sets of sets to multisets.
Actions, which transform states into new states, can then be seen as graph rewriting rules , which means a graph is matched in the graphs in a state via (sub)graph isomorphism, cut out, and a different graph is glued in in this position. (Rozenberg, 1997; Andersen et al., 2013). This gives the learner a handle to induce the elements of the action space.
2.2 World Programs
Program induction is the task of generating a program consistent with a specification provided by inputoutput examples (Gulwani et al., 2017). A plethora of algorithms has been proposed for this task. Purely symbolic approaches are very data efficient, and can learn from very few examples, however, they tend to struggle in the face of noisy data. In contrast, approaches based on lowbias neural networks such as seq2seqRNNs or transformers turn out to be quite robust to noise. However, without the adequate inductive biases they require large amounts of data to train, and do not generalise well beyond the training distribution. Also, several models combining neural and symbolic components have been proposed.(Gaunt et al., 2017; Bošnjak et al., 2017; Evans & Grefenstette, 2018; Bunel et al., 2018)
In our formalism, we define the triple as the world program, which the learner needs to learn from observations in the form of pairs. We can then treat each graphrewriting action as a subroutine, and as a map from a state to sets of subroutines . Furthermore, we can leverage the induced actions to learn the transition model by sampling actions , and applying them to to generate training examples consistent and inconsistent with the observations.
The concrete implementation of the program induction step is a hyperparameter. It can be realized by many different algorithms, using everything from purely symbolic to graph neural networks or purely distributed representations
(Segler & Waller, 2017a, b; Liu et al., 2017; Bradshaw et al., 2018; Yin et al., 2018; Bradshaw et al., 2019).2.3 Transition Functions and Policies
Graph neural networks (GNNs) have been employed in chemistry since decades (Kireev, 1995; Baskin et al., 1997; Merkwirth & Lengauer, 2005), and can be traced back to algorithms for graph canonicalisation developed for chemical databases (Morgan, 1965; Weisfeiler & Lehman, 1968) and pathbased graph featurisation (Wiener, 1947). GNNs have recently been shown to be useful as well in many other domains where relations between objects can be modelled as graphs, e.g. in physics and source code (Niepert et al., 2016; Kipf & Welling, 2017; Brockschmidt et al., 2018; Bapst et al., 2019). Here, to represent the states, graph neural networks provide the adequate inductive bias for the task.
2.4 Planning with World Programs
After the learner has induced the world program , it can be employed in a simulator in combination with any reinforcement learning algorithm to train an agent or perform planning. The discrete nature of the problem and the availability of the model lends itself immediately to Monte Carlo Tree Search (MCTS) (Kocsis & Szepesvári, 2006; Coulom, 2006) and its extensions.
3 Experiments
The formalism described above was applied for the problem of chemical synthesis planning for small organic molecules, which are of central importance for human wellbeing as medicines or materials (Todd, 2005). Here, starting with the target molecule we wish to make, the task is to perform a backward search by recursively deconstructing the molecule into increasingly simpler parts using formally reversed chemical reactions (graph rewriting rules), see Figure 2. The problem is solved iff the state is a subset of a predefined set of building blocks. More formally, given a set of building block graphs , a state is solved iff , that is all graphs in are isomorphic to a building block graph . The results are reproduced from (Segler et al., 2018), where this framework was first applied, however, formulated for a nonML audience.
3.1 Methods
The graphs in this task are small organic molecules, consisting of atoms as vertices and bonds as edges. To encode these graphs, a special case of Graph Neural Networks (GNN) was used, where the trainable weights are replaced by hash functions, which drastically speeds up inference while performing well (Rogers & Hahn, 2010). To induce the world program, we first take chemical reactions reported in the literature as our () observations, with the reagents and the product of the reactions as the states. We then first employ a symbolic search procedure to delete all nodes and edges from the molecular graphs that do not get changed in the course of the reaction, to form , with actions.^{4}^{4}4This is 3 orders of magnitude larger than the game of Go. To learn
, we train a neural network to predict the probability over all rules given s, and restrict the available actions to the topk rules with a cumulative probability of
.The transition function
is a binary classifier which takes the same GNN as an input for states, and takes the difference of the embeddings of
and as the representation of the action . For more details, we refer the reader to (Segler et al., 2018).3.2 Quantitative Results
Table 1 shows experimental results for several models in the task of predicting synthesis plans for 497 different molecules (Segler et al., 2018)
. In PUCTMCTS, the prior probability from the network representing
is used to bias the exploration term. Monte Carlo Search (MCS) (sampling from the prior), UCTMCTS, where the exploration term does not have a predicted probability contribution, and two Best First Search (BFS) variants all perform worse than PUCTMCTS.^{5}^{5}5It has to be noted that we did not tune most of the hyperparameters (i.e. the world program induction algorithm, the neural network architectures of the policy network and the transition function) of PUCTMCTS agent, except for the MCTS exploration constant. We therefore expect considerable gains by investigating these hyperparameters.Agent  % solved  time (s/plan) 

PUCTMCTS  95.24 0.09  13.0 
MCS  89.54 0.59  275.7 
UCTMCTS  87.12 0.29  30.4 
BFS (neural)  84.24 0.09  39.1 
BFS (heuristic) 
55.53 2.02  422.1 
Time budget 3300 s and 100,000 iterations for MC(T)S or 3300 s and 100,000 expansions for BFS, per molecule ( restarts were allowed). Std dev is given for the solved ratio. Results reproduced from (Segler et al., 2018)
3.3 Qualitative Results
A notorious problem with modelbased RL and planning is that strong learners can exploit imperfections in the simulator, leading to plans that do not translate to the real environment. Thus, to test the quality of the plans, we conducted two AB tests, with 45 expert organic chemists, who had to choose one of two plans in a double blind setup.
First, the subjects were presented with plans previously reported in the literature, and a plan generated by the PUCTMCTS algorithm for the same target molecule. The null hypothesis is that experts rate the computed plans as inferior (expected preference: computer 0% vs literature 100%). Surprisingly, we found that the experts perceived the computed routes (57.0%) as equivalent to the literature routes (43.0%).
In the second test, the experts had to report their preferences for plans found by PUCTMCTS or plans generated by a baseline system based on heuristic BFS. The subjects significantly preferred the routes generated by MCTS (68.2%) over the baseline system (31.8%).
4 Related topics
World program learning for discrete states and actions is related to the problem of inverse dynamics in continuous state and action spaces, e.g. for use in robotics, where the torques to produce a movement are induced from observed trajectories (NguyenTuong & Peters, 2010; Meier et al., 2016; Christiano et al., 2016). How to extend the presented framework here to the continuous setting remains a topic for future work.
Furthermore, there is a relation between world program induction and hierarchical reinforcement learning/option discovery (Sutton & Barto, 2018), which are predominantly used to extract spatiotemporal abstractions on multiple scales by joining sequences of (micro)actions into more abstract macro actions, where both micro and macroactions having the same type or unit. In contrast, in a world program, the full action and its atomistic components do not have the same type.
5 Conclusion and Future Challenges
Here, we have proposed a formalism to perform modelbased planning in complex compositional environments, where the learner a priori does not have knowledge of the action space. As graph neural networks have progressed from chemical modeling problems to all kinds of relational prediction problems, we hope that the machine learning community might also take inspiration to build on techniques developed for chemical planning problems to other relational planning problems, such as automated theorem proving, code optimization/rewriting, or physical construction and deconstruction problems.
To conclude, we would also like to pose a small challenge: Given only a small set of historically played Go and Chess games as grid graphs, can you first learn the rules of the games to learn a simulator, and then compete with an AlphaZerolike agent (Silver et al., 2018) using an agent trained in the learned simulator?
Acknowledgements
The author thanks M. Ahmed, T. Edlich, and B. Fabian for feedback on the manuscript, and the anonymous reviewers for helpful suggestions.
References
 Andersen et al. (2013) Andersen, J. L., Flamm, C., Merkle, D., and Stadler, P. F. Generic strategies for chemical space exploration. arXiv preprint arXiv:1302.4006, 2013.
 Bapst et al. (2019) Bapst, V., SanchezGonzalez, A., Doersch, C., Stachenfeld, K. L., Kohli, P., Battaglia, P. W., and Hamrick, J. B. Structured agents for physical construction. arXiv preprint arXiv:1904.03177, 2019.
 Baskin et al. (1997) Baskin, I. I., Palyulin, V. A., and Zefirov, N. S. A neural device for searching direct correlations between structures and properties of chemical compounds. Journal of chemical information and computer sciences, 37(4):715–721, 1997.
 Bošnjak et al. (2017) Bošnjak, M., Rocktäschel, T., Naradowsky, J., and Riedel, S. Programming with a differentiable forth interpreter. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 547–556. JMLR. org, 2017.
 Bradshaw et al. (2018) Bradshaw, J., Kusner, M. J., Paige, B., Segler, M. H., and HernándezLobato, J. M. A generative model for electron paths. arXiv preprint arXiv:1805.10970, 2018.
 Bradshaw et al. (2019) Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. H., and HernándezLobato, J. M. A model to search for synthesizable molecules. arXiv preprint arXiv:1906.05221, 2019.
 Brockschmidt et al. (2018) Brockschmidt, M., Allamanis, M., Gaunt, A. L., and Polozov, O. Generative code modeling with graphs. arXiv preprint arXiv:1805.08490, 2018.
 Bunel et al. (2018) Bunel, R., Hausknecht, M., Devlin, J., Singh, R., and Kohli, P. Leveraging grammar and reinforcement learning for neural program synthesis. arXiv preprint arXiv:1805.04276, 2018.
 Christiano et al. (2016) Christiano, P., Shah, Z., Mordatch, I., Schneider, J., Blackwell, T., Tobin, J., Abbeel, P., and Zaremba, W. Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518, 2016.
 Coulom (2006) Coulom, R. Efficient selectivity and backup operators in montecarlo tree search. In International conference on computers and games, pp. 72–83. Springer, 2006.
 DulacArnold et al. (2019) DulacArnold, G., Mankowitz, D., and Hester, T. Challenges of realworld reinforcement learning. arXiv preprint arXiv:1904.12901, 2019.

Evans & Grefenstette (2018)
Evans, R. and Grefenstette, E.
Learning explanatory rules from noisy data.
Journal of Artificial Intelligence Research
, 61:1–64, 2018.  Gaunt et al. (2017) Gaunt, A. L., Brockschmidt, M., Kushman, N., and Tarlow, D. Differentiable programs with neural libraries. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1213–1222. JMLR. org, 2017.
 Gulwani et al. (2017) Gulwani, S., Polozov, O., and Singh, R. Program synthesis. Foundations and Trends in Programming Languages, 4(12):1–119, 2017.
 Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pp. 2450–2462, 2018.
 Hamrick et al. (2018) Hamrick, J. B., Allen, K. R., Bapst, V., Zhu, T., McKee, K. R., Tenenbaum, J. B., and Battaglia, P. W. Relational inductive bias for physical construction in humans and machines. arXiv preprint arXiv:1806.01203, 2018.
 Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. International Conference on Learning Representations, 2017.
 Kireev (1995) Kireev, D. B. Chemnet: A novel neural network based method for graph/property mapping. Journal of chemical information and computer sciences, 35(2):175–180, 1995.
 Kocsis & Szepesvári (2006) Kocsis, L. and Szepesvári, C. Bandit based montecarlo planning. In European conference on machine learning, pp. 282–293. Springer, 2006.
 Liu et al. (2017) Liu, B., Ramsundar, B., Kawthekar, P., Shi, J., Gomes, J., Luu Nguyen, Q., Ho, S., Sloane, J., Wender, P., and Pande, V. Retrosynthetic reaction prediction using neural sequencetosequence models. ACS Cent. Sci., 3(10):1103–1113, 2017.
 Meier et al. (2016) Meier, F., Kappler, D., Ratliff, N., and Schaal, S. Towards robust online inverse dynamics learning. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4034–4039. IEEE, 2016.
 Merkwirth & Lengauer (2005) Merkwirth, C. and Lengauer, T. Automatic generation of complementary descriptors with molecular graph networks. Journal of chemical information and modeling, 45(5):1159–1168, 2005.
 Morgan (1965) Morgan, H. The generation of a unique machine description for chemical structuresa technique developed at chemical abstracts service. Journal of Chemical Documentation, 5(2):107–113, 1965.
 Ng et al. (2000) Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000.
 NguyenTuong & Peters (2010) NguyenTuong, D. and Peters, J. Using model knowledge for learning inverse dynamics. In 2010 IEEE International Conference on Robotics and Automation, pp. 2677–2682. IEEE, 2010.

Niepert et al. (2016)
Niepert, M., Ahmed, M., and Kutzkov, K.
Learning convolutional neural networks for graphs.
In International conference on machine learning, pp. 2014–2023, 2016.  Rogers & Hahn (2010) Rogers, D. and Hahn, M. Extendedconnectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
 Rozenberg (1997) Rozenberg, G. Handbook of Graph Grammars and Comp., volume 1. World scientific, 1997.

Schmidhuber (1990)
Schmidhuber, J.
Making the world differentiable: On using supervised learning fully recurrent neural networks for dynamic reinforcement learning and planning in nonstationary environments.
Technische Universität München Tech. Report: FKI12690, 1990.  Segler et al. (2017) Segler, M., Preuß, M., and Waller, M. P. Towards ”Alphachem”: Chemical synthesis planning with tree search and deep neural network policies. ICLR Workshop track, preprint arXiv:1702.00020, 2017.
 Segler & Waller (2017a) Segler, M. H. and Waller, M. P. Modelling Chemical Reasoning to Predict and Invent Reactions. Chem. Eur. J., 23(25):6118–6128, 2017a. ISSN 15213765. doi: 10.1002/chem.201604556.
 Segler & Waller (2017b) Segler, M. H. and Waller, M. P. Neuralsymbolic machine learning for retrosynthesis and reaction prediction. Chem. Eur. J., 23(25):5966–5971, 2017b.
 Segler et al. (2018) Segler, M. H., Preuss, M., and Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604, 2018.
 Silver et al. (2018) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science, 362(6419):1140–1144, 2018.
 Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, 2nd edition, 2018.
 Todd (2005) Todd, M. H. Computeraided organic synthesis. Chemical Society Reviews, 34(3):247–266, 2005.
 Weisfeiler & Lehman (1968) Weisfeiler, B. and Lehman, A. A. A reduction of a graph to a canonical form and an algebra arising during this reduction. NauchnoTechnicheskaya Informatsia, 2(9):12–16, 1968.
 Wiener (1947) Wiener, H. Structural determination of paraffin boiling points. Journal of the American Chemical Society, 69(1):17–20, 1947.
 Yin et al. (2018) Yin, P., Neubig, G., Allamanis, M., Brockschmidt, M., and Gaunt, A. L. Learning to represent edits. arXiv preprint arXiv:1810.13337, 2018.
Comments
There are no comments yet.