Program Synthesis Guided Reinforcement Learning

A key challenge for reinforcement learning is solving long-horizon planning and control problems. Recent work has proposed leveraging programs to help guide the learning algorithm in these settings. However, these approaches impose a high manual burden on the user since they must provide a guiding program for every new task they seek to achieve. We propose an approach that leverages program synthesis to automatically generate the guiding program. A key challenge is how to handle partially observable environments. We propose model predictive program synthesis, which trains a generative model to predict the unobserved portions of the world, and then synthesizes a program based on samples from this model in a way that is robust to its uncertainty. We evaluate our approach on a set of challenging benchmarks, including a 2D Minecraft-inspired “craft” environment where the agent must perform a complex sequence of subtasks to achieve its goal, a box-world environment that requires abstract reasoning, and a variant of the craft environment where the agent is a MuJoCo Ant. Our approach significantly outperforms several baselines, and performs essentially as well as an oracle that is given an effective program.


page 2

page 7


Just-in-Time Learning for Bottom-Up Enumerative Synthesis

A key challenge in program synthesis is the astronomical size of the sea...

DeepSynth: Program Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning

We propose a method for efficient training of deep Reinforcement Learnin...

Transformer-based Program Synthesis for Low-Data Environments

Recent advancements in large pre-trained transformer models (GPT2/3, T5)...

Program Synthesis Through Reinforcement Learning Guided Tree Search

Program Synthesis is the task of generating a program from a provided sp...

A Reinforcement Learning Environment for Mathematical Reasoning via Program Synthesis

We convert the DeepMind Mathematics Dataset into a reinforcement learnin...

GLDQN: Explicitly Parameterized Quantile Reinforcement Learning for Waste Reduction

We study the problem of restocking a grocery store's inventory with peri...

SynGuar: Guaranteeing Generalization in Programming by Example

Programming by Example (PBE) is a program synthesis paradigm in which th...

1 Introduction

Reinforcement learning has been applied to solving challenging planning and control problems (Mnih et al., 2015; Arulkumaran et al., 2017). Despite a significant amount of recent progress, solving long-horizon problems remains a significant challenge due to the combinatorial explosion of possible strategies.

One promising approach to addressing these issues is to leverage programs to guide the behavior of the agents (Andreas et al., 2017; Sun et al., 2020). In this paradigm, the user provides a sequence of high-level instructions designed to guide the agent. For instance, the program might encode intermediate subgoals that the agent should aim to achieve, but leave the reinforcement learning algorithm to discover how exactly to achieve these subgoals. In addition, to handle partially observable environments, these programs might encode conditionals that determine the course of action based on the agent’s observations.

The primary drawback of these approaches is that the user becomes burdened with providing such a program for every new task. Not only is this process time-consuming for the user, but a poorly written program may hamper learning. A natural question is whether we can automatically synthesize these programs. That is, rather than require the user to provide the program, we instead have them provide a high-level specification that encodes only the desired goal. Then, our framework automatically synthesizes a program that achieves this specification. Finally, this program is used to guide the reinforcement learning algorithm.

Figure 1: (a) An initial state for the craft environment. Bright regions are observed and dark ones are unobserved. This map has two zones separated by a stone boundary (blue line). The first zone contains the agent, 2 irons, and 1 wood; the second contains 1 iron and 1 gem. The goal is to get the gem. The agent represents the high-level structure of the map (e.g., resources in each zone) as abstraction variables. The ground truth abstraction variables are in the top-right; we only show the counts of gems, irons, and woods in each zone and the zone containing the agent. The two thought bubbles below are abstract variables hallucinated by the agent based on the observed parts of the map. In both, the zone that the agent is in contains a gem, so the synthesized program is “get gem”. However, this program cannot achieve the goal. (b) The state after the agent took 20 actions, failed to obtain the gem, and is now synthesizing a new program. They have explored more of the map, so the hallucinations are more accurate, and the new program is a valid strategy for obtaining the gem.

The key challenge to realizing our approach is how to handle partially observable environments. In the fully observed setting, the program synthesis problem reduces to STRIPS planning (Fikes and Nilsson, 1971)—i.e., search over the space of possible plans to find one that achieves the goal. However, these techniques are hard to apply in settings where the environment is initially unknown.

To address this challenge, we propose an approach called model predictive program synthesis (MPPS). At a high level, our approach synthesizes the guiding program based on a conditional generative model of the environment, but in a way that is robust to the uncertainty in this model. In particular, for a user-provided goal specification , the agent chooses its actions using the following three steps:

  • [topsep=0pt,itemsep=0ex,partopsep=1ex,parsep=1ex]

  • Hallucinator: First, inspired by world-models (Ha and Schmidhuber, 2018), the agent keeps track of a conditional generative model over possible realizations of the unobserved portions of the environment.

  • Synthesizer: Next, given the world predicted by , the agent synthesizes a program that achieves

    assuming this prediction is accurate. Since world predictions are stochastic in nature, it samples multiple predicted worlds and computes the program that maximizes the probability of success according to these samples.

  • Executor: Finally, the agent executes the strategy encoded by for a fixed number of steps . Concretely, is a sequence of components , where each component is an option  (Sutton et al., 1999), which says to execute policy until condition holds.

If is not satisfied after steps, then the above process is repeated. Since the hallucinator now has more information (assuming the agent has explored more of the environment), the agent now has a better chance of achieving its goal. Importantly, the agent is implicitly encouraged to explore since it must do so to discover whether the current program can successfully achieve the goal .

Similar to Sun et al. (2020), the user instantiates our framework in a new domain by providing a set of prototype components , where is a logical formula encoding a useful subtask for that domain. For instance, may encode that the agent should navigate to a goal position. The user does not need to provide a policy to achieve ; our framework uses reinforcement learning to automatically train such a policy . Our executor reuses these policies to solve different tasks in varying environments within the same domain. In particular, for a new task and/or environment, the user only needs to provide a specification , which is a logical formula encoding the goal of that task.

We instantiate this approach in the context of a 2D Minecraft-inspired environment (Andreas et al., 2017; Sohn et al., 2018; Sun et al., 2020), which we call the “craft environment”, and a “box-world” environment (Zambaldi et al., 2019). We demonstrate that our approach significantly outperforms existing approaches for partially observable environments, while performing essentially as well as using handcrafted programs to guide the agent. In addition, we demonstrate that the policy we learn can be transferred to a continuous variant of the craft environment, where the agent is replaced by a MuJoCo (Todorov et al., 2012) ant.

Related work. There has been recent interest in program-guided reinforcement learning, where a program encoding high-level instructions on how to achieve the goal (essentially, a sequence of options) is used to guide the agent. Andreas et al. (2017) uses programs to guide agents that are initially unaware of any semantics of the programs (i.e., the program is just a sequence of symbols), with the goal of understanding whether the structure of the program alone is sufficient to improve learning. Jothimurugan et al. (2019) enables users to write specifications in a high-level language based on temporal logic. Then, they show how to translate these specifications into shaped reward functions to guide learning. Most closely related is recent work (Sun et al., 2020) that has demonstrated how program semantics can be used to guide reinforcement learning in the craft environment. As with this work, we assume that the user provides semantics of each option in the program (i.e., the subgoal that should be achieved by that option), but not an actual policy implementing this option (which is learned using reinforcement learning). However, we do not assume that the user provides the program, just the overall goal.

More broadly, our work fits into the literature on combining high-level planning with reinforcement learning. In particular, there is a long literature on planning with options (Sutton et al., 1999) (also known as skills (Hausman et al., 2018)), including work on inferring options (Stolle and Precup, 2002). However, these approaches cannot be applied to MDPs with continuous state and action spaces or to partially observed MDPs. Recent work has addressed the former (Abel et al., 2020; Jothimurugan et al., 2021) by combining high-level planning with reinforcement learning to handle low-level control, but not the latter, whereas our work tackles both challenges. Similarly, classical planning algorithms such as STRIPS (Fikes and Nilsson, 1971) cannot handle uncertainty in the realization of the environment. There has also been work on replanning (Stentz and others, 1995) to handle small changes to an initially known environment, but they cannot handle environments that are initially completely unknown. Alternatively, there has been work on hierarchical planning in POMDPs (Charlin et al., 2007; Toussaint et al., 2008), but these are not designed to handle continuous state and action spaces. We leverage program synthesis (Solar-Lezama, 2008) in conjunction with the world models approach (Ha and Schmidhuber, 2018) to address these issues.

Finally, there has broadly been recent interest in using program synthesis to learn programmatic policies that are more interpretable (Verma et al., 2018; Inala et al., 2021), verifiable (Bastani et al., 2018; Verma, 2019), and generalizable (Inala et al., 2020). In contrast, we are not directly synthesizing the policy, but a program to guide the policy.

2 Motivating Example

Figure 0(a) shows a 2D Minecraft-inspired crafting game. In this grid world, the agent can navigate and collect resources (e.g., wood), build tools (e.g., a bridge) at workshops using collected resources, and use the tools to achieve subtasks (e.g., use a bridge to cross water). The agent can only observe the grid around its current position; since the environment is static, it also memorizes locations it has seen before. A single task consists of a randomly generated map (i.e., the environment) and goal (i.e., obtain a certain resource or build a certain tool).

To instantiate our framework, we provide prototype components that specify high-level behaviours such as getting wood or using toolshed to build a bridge. Figure 2 shows the domain-specific language that encodes the set of prototypes.

For each prototype, we need to provide a logical formula that formally specifies its desired behavior. Rather than specifying behavior over concrete state , we instead specify it over abstraction variables that encode subsets of the state space. For instance, we divide the map into zones that are regions separated by obstacles such as water and stone. As an example, the map in Figure 0(a) has two zones: the region containing the agent and the region blocked off by stones. Then, the zone the agent is currently in is represented by an abstraction variable —i.e., the states where the agent is in zone is represented by the logical predicate .

The prototype components are logical formulas over these abstraction variables—e.g., the prototype for “get wood” is

In this formula, indicates whether zones and are connected, denotes the count of resource in zone , and denotes the count of resource in the agent’s inventory. The and superscripts on each abstraction variable indicates that it represents the initial state of the agent before the execution of the prototype and the final state of the agent after the execution of the prototype, respectively.

Thus, this formula says that (i) the agent goes from zone to , (ii) and are connected, (iii) the count of wood in the agent’s inventory increases by one, and (iv) the count of wood in zone decreases by one. All of the prototype components we use are summarized in Appendix A.

Before solving any tasks, for each prototype , our framework uses reinforcement learning to train a component that implements —i.e., an option that attempts to satisfy the behavior encoded by the logical formula .

To solve a new task, the user provides a logical formula encoding the goal of this task. Then, the agent acts in the environment to try achieve . For example, Figure 0(a) shows the initial state of an agent where the task is to obtain a gem.

First, based on the observations so far, the agent uses the hallucinator to predict multiple potential worlds, each of which represents a possible realization of the full map. One convenient aspect of our approach is that rather than predicting concrete states, it suffices to predict the abstraction variables used in the prototype components and goal specification . For instance, Figure 0(a) shows two samples of the world predicted by ; here, the only values it predicts are the number of zones in the map, the type of the boundary between the zones, and the counts of the resources and workshops in each zone. In this example, the first predicted world contains two zones, and the second contains one zone. Note that in both predicted worlds, there is a gem located in same zone as the agent.

Next, synthesizes a program that achieves the goal in the maximum possible number of predicted worlds. The synthesized program in Figure 0(a) is a single component “get gem”, which is an option that searches the current zone (or zones already connected with the current zone) for a gem. Note that this program achieves the goal for the predicted worlds shown in Figure 0(a).

Finally, the agent executes the program for a fixed number of steps. In particular, it executes the policy of component until holds, upon which it switches to executing . In our example, there is only one component “get gem”, so it executes the policy for this component until the agent finds a gem.

In this case, the agent fails to achieve since there is no gem in the same zone as the agent. Thus, the agent repeats the above process. Since the agent now has more observations, more accurately predicts the world. For instance, Figure 0(b) shows the intermediate step when the agent does the first replanning. Note that it now correctly predicts that the only gem is in the second zone. As a result, the newly synthesized program is

That is, it builds an axe to break the stone so it can get to the zone containing the gem. Finally, the agent executes this new program, which successfully finds the gem.

Figure 2: Prototype components for the craft environment; the three kinds of prototypes are get resource (), use tool (), and use workshop ().

3 Problem Formulation


We consider a partially observed Markov decision process (POMDP) with states

, actions , observations , initial state distribution , observation function , and transition function . Given initial state , policy , and time horizon , the generated trajectory is , where , , and .

We assume that the state includes the unobserved parts of the environment—e.g., in our craft environment, it represents both the entire map as well as the agent’s current position.

Programs. We consider programs that are composed of components . Each component represents an option , where is a policy and . To execute , the agent uses the options in sequence. To use option , it takes actions until ; at this point, the agent switches to option and continues this process.

User-provided prototype components. Rather than have the user directly provide the components used in our programs, we instead have them provide prototype components . Importantly, prototypes can be shared across closely related tasks. Each prototype component is a logical formula that encodes the expected desired behavior of a component. More precisely, is a logical formula over variables and , where denotes the initial state before executing the option and denotes the final state after executing the option. For instance, the prototype component

says that if the POMDP is currently in state , then should transition it to , and if it is currently in state , then should transition it to .

Rather than directly define over the states , we can instead define it over abstraction variables that represent subsets of the state space. This approach can improve scalability of our synthesis algorithm—e.g., it enables us to operate over continuous state spaces as long as the abstraction variables themselves are discrete.

User-provided specification. To specify a task, the user provides a specification , which is a logical formula over states ; in general, may not directly refer to but to other variables that represent subsets of . Our goal is to design an agent that achieves any given (i.e., act in the POMDP to reach a state that satisfies ) as quickly as possible.

4 Model Predictive Program Synthesis

Figure 3: Architecture of our agent (the blue box).

We describe the architecture of our agent, depicted in Figure 3. It is composed of three parts: the hallucinator , which predicts possible worlds; the synthesizer, which generates a program that succeeds with high probability according to worlds sampled from ; and the executor, which uses to act in the POMDP. These parts are run once every steps to generate a program to execute for the subsequent steps, until the user-provided specification is achieved.

Hallucinator. First, the hallucinator is a conditional generative model trained to predict the environment given the observation so far. For simplicity, we assume the observation on the current step already encodes all observations so far. Since our craft environment is static, simply encodes the portion of the map that has been revealed so far, with a special symbol indicating parts that are unknown. To be precise, the hallucinator encodes a distribution , which is trained to approximate the actual distribution . Then, at each iteration (i.e., once every steps), our agent samples worlds . We choose to be a conditional variational auto-encoder (CVAE) (Sohn et al., 2015).

When using abstract variables to represent the states, we can have directly predict the values of these abstract variables instead of having predict the concrete state. Intuitively, this approach works since as described below, the synthesizer only needs to know the values of the abstract variables to generate a program.

Synthesizer. The synthesizer aims to compute the program that maximizes the probability of satisfying the goal :


where the are samples from . The objective (1) can be expressed as a MaxSAT problem (Krentel, 1986). In particular, suppose for now that we are searching over programs of fixed length . Then, consider the constrained optimization problem


where and (for and ) are the optimization variables. Intuitively, encodes the program , and encodes the event that solves for world . In particular, we have


encodes that the initial state is ,

encodes that if the th component has prototype , then the th component should transition the system from to ,

encodes that the final state of component should equal the initial state of component , and

encodes that the final state of the last component should satisfy the user-provided goal .

We use a MaxSAT solver to solve (2(De Moura and Bjørner, 2008). Given a solution , the synthesizer returns the corresponding program .

We incrementally search for longer and longer programs, starting from and incrementing until either we find a program that achieves at least a minimum objective value, or we reach a maximum program length , at which point we use the best program found so far.

Executor. The executor runs the synthesized program for the subsequent steps. It iteratively uses each component , starting from . In particular, it uses action at each time step , where is the observation on that step. It does so until , at which point it increments .

Finally, it continues until either it has completed running the program (i.e., ), or after time steps. In the former case, by construction, the goal has been achieved, so the agent terminates. In the latter case, the agent iteratively reruns the above three steps based on the current observation to synthesize a new program. At this point, the hallucinator likely has additional information about the environment, so the new program has a greater chance of achieving .

5 Learning Algorithm

Next, we describe our algorithm for learning the parameters of models used by our agent. In particular, there are two parts that need to be learned: (i) we need to learn parameters of the conditional variational auto-encoder (CVAE) hallucinator , and (ii) we need to learn the components based on the user-provided prototype components .

Hallucinator. We choose the hallucinator to be a conditional variational auto-encoder (CVAE) (Sohn et al., 2015)

trained to estimate the distribution

of states given the current observation. First, we obtain samples using rollouts collected using a random agent. Then, we train the CVAE using the standard evidence lower bound (ELBo) on the log likelihood (Kingma and Welling, 2013):


where is the encoder and is the decoder:

where , , , and

are neural networks, and

is the identity matrix. We train

and by jointly optimizing (3), and then choose the hallucinator to be .

Executor. Our framework uses reinforcement learning to learn components that implement the user-provided prototype components . The learned components can be shared across multiple tasks. Our approach is based on neural module networks for reinforcement learning Andreas et al. (2017). In particular, we train a neural module for each component . In addition, we construct a monitor that checks when to terminate execution, and take .

First, is constructed from —in particular, it returns whether is satisfied based on the current observation . Note that we have assumed that can be checked based only on ; this assumption holds for all prototypes in our craft environment. If it does not hold, we additionally train to explore in a way that enables it to check .

Now, to train the policies , we generate random initial states and goal specifications . For training, we use programs synthesized from the fully observed environments; such a program is guaranteed to achieve from . We use this approach since it avoids the need to run the synthesizer repeatedly during training.

Then, we sample a rollout by using the executor in conjunction with the program and the current options (where is randomly initialized). We give the agent a reward on each time step where achieves the subgoal of a single component —i.e., the executor increments . Then, we use actor-critic reinforcement learning (Konda and Tsitsiklis, 2000) to update the parameters of each policy .

Finally, as in Andreas et al. (2017), we use curriculum learning to speed up training—i.e., we train using goals that can be achieved with shorter programs first.

6 Experiments

In this section, we describe empirical evaluations of our approach. As we show, it significantly outperforms non-program-guided baselines, while performing essentially as well as an oracle that is given the ground truth program.

Figure 4: (a,b) Training curves for 2D-craft environment. (c,d) Training curves for the box-world environment. (a,c) The average reward on the test set over the course of training; the agent gets a reward of 1 if it successfully finishes the task in the time horizon, and 0 otherwise. (b,d) The average number of steps taken to complete the task on the test set. We show our approach (“Ours”), the program guided agent (“Oracle”), the end-to-end neural policy (“End-to-end”), world models (“World models”), and relational deep RL (“Relational”).
Figure 5: (a) The Ant-craft environment. The policy needs to control the ant to perform the crafting tasks. (b) The box-world environment. The grey pixel denotes the agent. The goal is to get the white key. The unobserved parts of the map is marked with “x”. The key currently held by the agent is shown in the top-left corner. In this map, the number of boxes in the path to the goal is 4, and it contains 1 distractor branch.

6.1 Benchmarks

2D-craft. We consider a 2D Minecraft-inspired crafting game based on the ones in Andreas et al. (2017); Sun et al. (2020) (Figure 0(a)). A map in this domain is an grid, where each grid cell either is empty or contains a resource (e.g., wood or gold), an obstacle (e.g., water or stone), or a workshop. In each episode, we randomly sample a map from a predefined distribution, a random initial position for the agent, and a random task (one of 14 possibilities, each of which involves getting a certain resource or building a certain tool). The more complicated tasks may require the agent to build intermediate tools (e.g., a bridge or an axe) to reach initially inaccessible regions to achieve its goal. In contrast to prior work, our agent does not initially observe the entire map; instead, they can only observe grid cells in a square around them. Since the environment is static, any previously visited cells remain visible. The agent has a discrete action space, including move actions in four directions, and a special “use” action that can pick a resource, use a workshop, or use a tool. The maximum length of each episode .

Ant-craft. Next, we consider a variant of 2D-craft where the agent is replaced with a MuJoCo (Todorov et al., 2012) ant (Schulman et al., 2016) (illustrated in Figure 4(a)). For simplicity, we do not model the physics of the interaction between the ant and its environment—e.g., the ant automatically picks up resources in the grid cell it currently occupies. The policy needs to learn the continuous control to walk the ant as well as the strategy to perform the tasks. This environment is designed to demonstrate that our approach can be applied to continuous control tasks.

Box-world. Finally, we consider the box-world environment (Zambaldi et al., 2019), which requires abstract relational reasoning. It is a grid world with locks and boxes randomly scattered throughout (visualized in Figure 4(b)). Each lock occupies a single grid cell, and the box it locks occupies the adjacent grid cell. The box contains a key that can open a subsequent lock. Each lock and box is colored; the key needed to open a lock is contained in the box of the same color. The agent is given a key to get started, and its goal is to unlock the box of a given color. The agent can move in the room in four directions; it opens a lock for which it has the key simply by walking over it, at which point it can pick up the adjacent key. We assume that once the agent has the key of a given color, it can unlock multiple locks of that color. We modify the original environment to be partially observable; in particular, the agent can observe a grid around them (as well as the previously observed grid cells). In each episode, we sample a random configuration of the map, where the number of boxes in the path to the goal is randomly chosen between 1 to 4, and the number of “distractor branches” (i.e., boxes that the agent can open but does not help them reach the goal) is also randomly chosen between 1 to 4.

6.2 Baselines

End-to-end. An end-to-end neural policy trained with the same actor-critic algorithm and curriculum learning as discussed in Section 5. It uses one actor network per task.

World models. The world models approach (Ha and Schmidhuber, 2018) handles partial observability by using a generative model to predict the future. It trains a V model (VAE) and an M model (MDN-RNN) to learn a compressed spatial and temporal representation of the environment. The V model takes the observations at each step

and encodes it into a latent vector

. The M model is a recurrent model that takes the latent vectors as input and predicts . The latent states of the M model and the latent vectors from the V model together form the world model features, which are used as inputs to the controller (C model).

Program guided agent. This technique uses a program to guide the agent policy (Sun et al., 2020). Unlike our approach, the ground truth programs (i.e., a program guaranteed to achieve the goal) is provided to the agent at the beginning; we synthesize this program using the full map (i.e., including parts of the map that are unobserved by the agent). This baseline can be viewed as an oracle since it is strictly more powerful than our approach.

Relational Deep RL. For the box-world environment, we also compare with the relational deep RL approach (Zambaldi et al., 2019), which replaces the policy network with a relational module based on the multi-head attention mechanism (Vaswani et al., 2017) operating over the map features. The output of the relational module is used as input to an MLP network that computes the action.

Figure 6: Comparison of behaviors between the optimistic approach (left) and our MPPS approach (right), in a scenario where goal is to get the gem. (a) This state is the point at which the optimistic approach first synthesizes the correct program instead of the (incorrect) one “get gem”. It only does so after the agent has observed all the squares in its current zone (the green arrows show the agent’s trajectory so far). (b) The initial state of our MPPS strategy. It directly synthesizes the correct program, since the hallucinator knows the gem is most likely in the other zone. Thus, the agent completes the task much more quickly.

6.3 Implementation Details

2D-craft environment. For our approach, we use a CVAE as the hallucinator with MLPs (a hidden layer of dimension 200) for the encoder and the decoder. We pre-train the CVAE on 100 rollouts with 100 timesteps in each rollout—i.e., 10,000 pairs. We use the Z3 SMT solver (De Moura and Bjørner, 2008) to solve the MAXSAT synthesis formula. We set the number of sample completions , and the number of steps to replan . We use the same architecture for the actor networks and critic networks across our approach and all baselines: for actor networks, we use MLP with a hidden layer of dimension 128, and for critic networks, we use MLP with a hidden layer of dimension 32. We train each model on 400K episodes, and evaluate on a test set containing 10 scenarios per task.

Ant-craft. We first pre-train a goal following policy for the ant: given a randomly chosen goal position, this policy controls the ant to move to that position. We use the soft actor-critic algorithm (Haarnoja et al., 2018) for pre-training. The executor in our approach, as well as our baseline policies, outputs actions that are translated into the goal positions as inputs to this ant controller. We let the ant controller run for 50 timesteps in the simulator to execute each move action from the upper-stream policies. We initialize each policy with the trained model from the 2D-craft environment, and fine-tune it on the Ant-craft environment for 40K episodes.

Box-world. Following Zambaldi et al. (2019), we use a one-layer CNN with 32 kernels of size to process the raw map inputs before feeding into the downstream networks across all approaches. For the programs in our approach, we have a prototype component for each color, where the desired behavior of the component is to get the key of that color. The full definition of the prototype components we use for box-world is in Appendix B. For the hallucinator CVAE, we use the same architecture as in the craft environment with a hidden dimension of 300, and trained with 100k pairs. For the synthesizer, we set and . We train each model for 200K episodes, and evaluate on a test set containing 10 scenarios per level. Each level has a specific number of boxes in the path to the goal (i.e., the goal length). Our test set contains four levels with goal lengths between 1 to 4.

Avg. reward Avg. finish step
End-to-end 0.49 60.7
World models 0.50 59.3
Ours 0.93 26.7
Oracle 0.93 25.9
Table 1: Average rewards and average completion times (i.e., number of steps) on the test set for Ant-craft environment, for the best policy found for each approach.

6.4 Results

2D-craft. Figures 3(a) & 3(b) show the training curves of each approach. As can be seen, our approach learns a substantially better policy than the unsupervised baselines; it solves a larger percentage of test scenarios as well as using shorter time. Compared with program guided agent (i.e., the oracle), our approach achieves a similar average reward with slightly longer average finish time. These results demonstrate that our approach significantly outperforms non-program-guided baselines, while performing nearly as well as an oracle that knows the ground truth program.

Ant-craft. Table 1 shows results for the best policy found using each approach. As before, our approach significantly outperforms the baseline approaches while performing comparably with the oracle approach.

Box-world. Figure 3(c) & 3(d) shows the training curves. As before, our approach performs substantially better than the baselines, and achieves a similar performance as the program guided agent (i.e., the oracle).

Avg. reward Avg. finish step
Optimistic 0.60 53.7
Ours 0.79 41.8
Oracle 0.79 37.7
Table 2: Comparison to optimistic ablation on challenging tasks for the 2D-craft environment.

6.5 Optimistic Ablation

Finally, we compare our model predictive program synthesis with an alternative, optimistic synthesis strategy: it considers the unobserved parts of the map to be possibly in any configurations, and synthesizes the shortest program as long as it works on any of these possibilities. We compare on the most challenging tasks for 2D-craft (i.e., get gold or get gem), since for these tasks, the ground truth program depends heavily on the map. We show results in Table 2. As can be seen, our approach significantly outperforms the optimistic synthesis approach, and performs comparably to the oracle. Finally, in Figure 6, we illustrate the difference in behavior between our approach and the optimistic strategy.

7 Conclusion

We have proposed an algorithm for synthesizing programs to guide reinforcement learning. Our algorithm, called model predictive program synthesis, handles partially observed environments by leveraging the world models approach, where it learns a generative model over the remainder of the world conditioned on the observations thus far. In particular, it synthesizes a guiding program that accounts for the uncertainty in the world model. Our experiments demonstrate that our approach significantly outperforms non-program-guided approaches, while performing comparably to an oracle that is given access to the ground truth program. These results demonstrate that our approach can obtain the benefits of program-guided reinforcement learning without requiring the user to provide a guiding program for every new task and world configurations.


  • D. Abel, N. Umbanhowar, K. Khetarpal, D. Arumugam, D. Precup, and M. Littman (2020) Value preserving state-action abstractions. In

    International Conference on Artificial Intelligence and Statistics

    pp. 1639–1650. Cited by: §1.
  • J. Andreas, D. Klein, and S. Levine (2017) Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 166–175. External Links: Link Cited by: §1, §1, §1, §5, §5, §6.1.
  • K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath (2017) Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine 34 (6), pp. 26–38. Cited by: §1.
  • O. Bastani, Y. Pu, and A. Solar-Lezama (2018) Verifiable reinforcement learning via policy extraction. arXiv preprint arXiv:1805.08328. Cited by: §1.
  • L. Charlin, P. Poupart, and R. Shioda (2007) Automated hierarchy discovery for planning in partially observable environments. Advances in Neural Information Processing Systems 19, pp. 225. Cited by: §1.
  • L. De Moura and N. Bjørner (2008) Z3: an efficient smt solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS’08/ETAPS’08, Berlin, Heidelberg, pp. 337–340. External Links: ISBN 3540787992 Cited by: §4, §6.3.
  • R. E. Fikes and N. J. Nilsson (1971) STRIPS: a new approach to the application of theorem proving to problem solving. Artificial intelligence 2 (3-4), pp. 189–208. Cited by: §1, §1.
  • D. Ha and J. Schmidhuber (2018) World models. CoRR abs/1803.10122. External Links: Link, 1803.10122 Cited by: 1st item, §1, §6.2.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1861–1870. External Links: Link Cited by: §6.3.
  • K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller (2018) Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, Cited by: §1.
  • J. P. Inala, O. Bastani, Z. Tavares, and A. Solar-Lezama (2020) Synthesizing programmatic policies that inductively generalize. In International Conference on Learning Representations, Cited by: §1.
  • J. P. Inala, Y. Yang, J. Paulos, Y. Pu, O. Bastani, V. Kumar, M. Rinard, and A. Solar-Lezama (2021) Neurosymbolic transformers for multi-agent communication. arXiv preprint arXiv:2101.03238. Cited by: §1.
  • K. Jothimurugan, R. Alur, and O. Bastani (2019) A composable specification language for reinforcement learning tasks. In NeurIPS, Cited by: §1.
  • K. Jothimurugan, O. Bastani, and R. Alur (2021) Abstract value iteration for hierarchical reinforcement learning. In AISTATS, Cited by: §1.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §5.
  • V. Konda and J. Tsitsiklis (2000) Actor-critic algorithms. In Advances in Neural Information Processing Systems, S. Solla, T. Leen, and K. Müller (Eds.), Vol. 12, pp. 1008–1014. External Links: Link Cited by: §5.
  • M. W. Krentel (1986) The complexity of optimization problems. In

    Proceedings of the Eighteenth Annual ACM Symposium on Theory of Computing

    STOC ’86, New York, NY, USA, pp. 69–76. External Links: ISBN 0897911938, Link, Document Cited by: §4.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2016) High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §6.1.
  • K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. 3483–3491. External Links: Link Cited by: §4, §5.
  • S. Sohn, J. Oh, and H. Lee (2018) Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Red Hook, NY, USA, pp. 7156–7166. Cited by: §1.
  • A. Solar-Lezama (2008) Program synthesis by sketching. Citeseer. Cited by: §1.
  • A. Stentz et al. (1995) The focussed d^* algorithm for real-time replanning. In IJCAI, Vol. 95, pp. 1652–1659. Cited by: §1.
  • M. Stolle and D. Precup (2002) Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pp. 212–223. Cited by: §1.
  • S. Sun, T. Wu, and J. J. Lim (2020) Program guided agent. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §1, §1, §6.1, §6.2.
  • R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: 3rd item, §1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 5026–5033. External Links: Document Cited by: §1, §6.1.
  • M. Toussaint, L. Charlin, and P. Poupart (2008) Hierarchical pomdp controller optimization by likelihood maximization.. In UAI, Vol. 24, pp. 562–570. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. External Links: 1706.03762 Cited by: §6.2.
  • A. Verma, V. Murali, R. Singh, P. Kohli, and S. Chaudhuri (2018) Programmatically interpretable reinforcement learning. In International Conference on Machine Learning, pp. 5045–5054. Cited by: §1.
  • A. Verma (2019) Verifiable and interpretable reinforcement learning through program synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9902–9903. Cited by: §1.
  • V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals, and P. Battaglia (2019) Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations, External Links: Link Cited by: §1, §6.1, §6.2, §6.3.

Appendix A Prototype Components for Craft

In this section, we describe the prototype components (i.e., logical formulas encoding option pre/postconditions) that we use for the craft environment. First, recall that the domain-specific language that encodes the set of prototypes for the craft environment is

Also, the set of possible artifacts (objects that can be made in some workshop using resources or other artifacts) in the craft environment is

We define the following abstraction variables:

  • [topsep=0pt,itemsep=0ex,partopsep=1ex,parsep=1ex]

  • Zone: indicates the agent is in zone

  • Boundary: indicates how zones and are connected, where

  • Resource: indicates that there are units of resource in zone

  • Workshop: , where , indicates whether there exists a workshop in zone

  • Inventory: indicates that there are objects (either a resource or an artifact) in the agent’s inventory

We use and to denote the initial state and final state for a prototype components, respectively. Now, the logical formulae for each prototype components are defined as follows.

(1) “get ” (for any resource ). First, we have the following prototype component telling the agent to obtain a specific resource :

Here, refers to the conditions that the other fields of the abstract state stay the same—i.e.,

where means all the other fields in except , and similarly for . In particular addresses the frame problem from classical planning.

(2) “use ” (for any workshop ). Next, we have a prototype component telling the agent to use a workshop to create an artifact. To do so, we introduce a set of auxiliary variables to denote the number of artifacts made in this component: indicates that units of artifact is made, the set of artifacts that can be made at workshop as , and the number of units of ingredient needed to make 1 unit of artifact as , where ; note that and come from the rule of the game.

Then, the logical formula for “use ” is


This formula reflects the game setting that when the agent uses a workshop, it will make the artifacts until the ingredients in the inventory are depleted.

(3) “use r” ( bridge/axe). Next, we have the following prototype component for telling the agent to use a tool. The formula for this prototype component encodes the logic of zone connectivity. In particular, it is


Appendix B Prototype Components for Box World

In this section, we describe the prototype components for the box world. They are all of the form “get ”, where is a color in the set of possible colors in the box world. First, we define the following abstraction variables:

  • [topsep=0pt,itemsep=0ex,partopsep=1ex,parsep=1ex]

  • Box: indicates that there are boxes with key color and lock color in the map

  • Loose key: , where , indicates whether there exists a loose key of color in the map

  • Agent’s key: , where , indicates whether the agent holds a key of color

As in the craft environment, we use and to denote the initial state and final state for a prototype components, respectively. Since the configurations of the map in the box world can only contain at most one loose key, we add cardinality constraints , where counts the number of variables that are true.

Then, the logical formula defining the prototype component “get ” is


In particular, encodes the desired behavior when the agent picks up a loose key , and encodes the desired behavior when the agent unlocks a box to get key .