Multitask Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies

by   Sungryull Sohn, et al.
University of Michigan

We introduce a new RL problem where the agent is required to execute a given subtask graph which describes a set of subtasks and their dependency. Unlike existing multitask RL approaches that explicitly describe what the agent should do, a subtask graph in our problem only describes properties of subtasks and relationships among them, which requires the agent to perform complex reasoning to find the optimal sequence of subtasks to execute. To tackle this problem, we propose a neural subtask graph solver (NSS) which encodes the subtask graph using a recursive neural network. To overcome the difficulty of training, we propose a novel non-parametric gradient-based policy to pre-train our NSS agent. results on two 2D visual domains show that our agent can perform complex reasoning to find a near-optimal way of executing the subtask graph and generalize well to the unseen subtask graphs. In addition, we compare our agent with a Monte-Carlo tree search (MCTS) method showing that (1) our method is much more efficient than MCTS and (2) combining MCTS with NSS dramatically improves the search performance.



There are no comments yet.


page 1

page 2

page 3

page 4


ReinforceWalk: Learning to Walk in Graph with Monte Carlo Tree Search

Learning to walk over a graph towards a target node for a given input qu...

Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of LTL formulas

We demonstrate a reinforcement learning agent which uses a compositional...

Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning

As a step towards developing zero-shot task generalization capabilities ...

Domain Adversarial Reinforcement Learning

We consider the problem of generalization in reinforcement learning wher...

Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies

We propose and address a novel few-shot RL problem, where a task is char...

Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning

We present Placeto, a reinforcement learning (RL) approach to efficientl...

Optimization-Based Algebraic Multigrid Coarsening Using Reinforcement Learning

Large sparse linear systems of equations are ubiquitous in science and e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Developing the ability to execute many different tasks depending on given task descriptions and generalize over unseen task descriptions is an important problem for building scalable reinforcement learning (RL) agents. Recently, there have been a few attempts to define and solve different forms of task descriptions such as natural language (Oh et al., 2017; Yu et al., 2017) or formal language (Denil et al., 2017; Andreas et al., 2017). However, most of the prior works have focused on task descriptions which explicitly specify what the agent should do at a high level, which may not be readily available in real-world applications.

To further motivate the problem, let’s consider a scenario in which an agent needs to generalize to a complex novel task by performing a composition of subtasks where the task description and dependencies among subtasks may change depending on the situation. For example, a human user could ask a physical household robot to make a meal in an hour. A meal may be served with different combinations of dishes, each of which takes a different amount of cost (e.g., time) and gives a different amount of reward (e.g., user satisfaction) depending on the user preferences. In addition, there can be complex dependencies between subtasks. For example, a bread should be sliced before toasted, or an omelette and an egg sandwich cannot be made together if there is only one egg left. Due to such complex dependencies as well as different rewards and costs, it is often cumbersome for human users to manually provide the optimal sequence of subtasks (e.g., “fry an egg and toast a bread”). Instead, the agent should learn to act in the environment by figuring out the optimal sequence of subtasks that gives the maximum reward within a time budget just from properties and dependencies of subtasks.

Figure 1: Example task and our agent’s trajectory. The agent is required to execute subtasks in the optimal order to maximize the reward within a time limit. The subtask graph describes subtasks with the corresponding rewards (e.g., subtask L gives 1.0 reward) and dependencies between subtasks through AND and OR nodes. For instance, the agent should first get the firewood (D) OR coal (G) to light a furnace (J). In this example, our agent learned to execute subtask F and its preconditions (shown in red) as soon as possible, since it is a precondition of many subtasks even though it gives a negative reward. After that, the agent mines minerals that require stone pickaxe and craft items (shown in blue) to achieve a high reward.

The goal of this paper is to formulate and solve such a problem, which we call subtask graph execution, where the agent should execute the given subtask graph in an optimal way as illustrated in Figure 1. A subtask graph consists of subtasks, corresponding rewards, and dependencies among subtasks in logical expression form where it subsumes many existing forms (e.g., sequential instructions (Oh et al., 2017)). This allows us to define many complex tasks in a principled way and train the agent to find the optimal way of executing such tasks. Moreover, we aim to solve the problem without explicit search or simulations so that our method can be more easily applicable to practical real-world scenarios, where real-time performance (i.e., fast decision-making) is required and building the simulation model is extremely challenging.

To solve the problem, we propose a new deep RL architecture, called neural subtask graph solver (NSGS), which encodes a subtask graph using a recursive-reverse-recursive neural network (R3NN) (Parisotto et al., 2016) to consider the long-term effect of each subtask. Still, finding the optimal sequence of subtasks by reflecting the long-term dependencies between subtasks and the context of observation is computationally intractable. Therefore, we found that it is extremely challenging to learn a good policy when it’s trained from scratch. To address the difficulty of learning, we propose to pre-train the NSGS to approximate our novel non-parametric policy called graph reward propagation policy. The key idea of the graph reward propagation policy is to construct a differentiable representation of the subtask graph such that taking a gradient over the reward results in propagating reward information between related subtasks, which is used to find a reasonably good subtask to execute. After the pre-training, our NSGS architecture is finetuned using the actor-critic method.

The experimental results on 2D visual domains with diverse subtask graphs show that our agent implicitly performs complex reasoning by taking into account long-term subtask dependencies as well as the cost of executing each subtask from the observation, and it can successfully generalize to unseen and larger subtask graphs. Finally, we show that our method is computationally much more efficient than Monte-Carlo tree search (MCTS) algorithm, and the performance of our NSGS agent can be further improved by combining with MCTS, achieving a near-optimal performance.

Our contributions can be summarized as follows: (1) We propose a new challenging RL problem and domain with a richer and more general form of graph-based task descriptions compared to the recent works on multitask RL. (2) We propose a deep RL architecture that can execute arbitrary unseen subtask graphs and observations. (3) We demonstrate that our method outperforms the state-of-the-art search-based method (e.g., MCTS), which implies that our method can efficiently approximate the solution of an intractable search problem without performing any search. (4) We further show that our method can also be used to augment MCTS, which significantly improves the performance of MCTS with a much less amount of simulations.

2 Related Work

Programmable Agent

The idea of learning to execute a given program using RL was introduced by programmable hierarchies of abstract machines (PHAMs) (Parr and Russell, 1997; Andre and Russell, 2000, 2002). PHAMs specify a partial policy using a set of hierarchical finite state machines, and the agent learns to execute the partial program. A different way of specifying a partial policy was explored in the deep RL framework (Andreas et al., 2017). Other approaches used a program as a form of task description rather than a partial policy in the context of multitask RL (Oh et al., 2017; Denil et al., 2017). Our work also aims to build a programmable agent in that we train the agent to execute a given task. However, most of the prior work assumes that the program specifies what to do, and the agent just needs to learn how to do it. In contrast, our work explores a new form of program, called subtask graph (see Figure 1), which describes properties of subtasks and dependencies between them, and the agent is required to figure out what to do as well as how to do it.

Hierarchical Reinforcement Learning

Many hierarchical RL approaches have been proposed to solve complex decision problems via multiple levels of temporal abstractions (Sutton et al., 1999; Dietterich, 2000; Precup, 2000; Ghavamzadeh and Mahadevan, 2003; Konidaris and Barto, 2007). Our work builds upon the prior work in that a high-level controller focuses on finding the optimal subtask, while a low-level controller focuses on executing the given subtask. In this work, we focus on how to train the high-level controller for generalizing to novel complex dependencies between subtasks.

Classical Search-Based Planning

One of the most closely related problems is the planning problem considered in hierarchical task network (HTN) approaches (Sacerdoti, 1975; Erol, 1996; Erol et al., 1994; Nau et al., 1999; Castillo et al., 2005) in that HTNs also aim to find the optimal way to execute tasks given subtask dependencies. However, they aim to execute a single goal task, while the goal of our problem is to maximize the cumulative reward in RL context. Thus, the agent in our problem not only needs to consider dependencies among subtasks but also needs to infer the cost from the observation and deal with stochasticity of the environment. These additional challenges make it difficult to apply such classical planning methods to solve our problem.

Motion Planning

Another related problem to our subtask graph execution problem is motion planning (MP) problem (Asano et al., 1985; Canny, 1985, 1987; Faverjon and Tournassoud, 1987; Keil and Sack, 1985). MP problem is often mapped to a graph, and reduced to a graph search problem. However, different from our problem, the MP approaches aim to find an optimal path to the goal in the graph while avoiding obstacles similar to HTN approaches.

3 Problem Definition

3.1 Preliminary: Multitask Reinforcement Learning and Zero-Shot Generalization

We consider an agent presented with a task drawn from some distribution as in Andreas et al. (2017); Da Silva et al. (2012)

. We model each task as Markov Decision Process (MDP). Let

be a task parameter available to agent drawn from a distribution where defines the task and is a set of all possible task parameters. The goal is to maximize the expected reward over the whole distribution of MDPs: , where is the expected return of the policy given a task defined by , is a discount factor, is a multitask policy that we aim to learn, and is the reward at time step . We consider a zero-shot generalization where only a subset of tasks is available to agent during training, and the agent is required to generalize over a set of unseen tasks for evaluation, where .

3.2 Subtask Graph Execution Problem

The subtask graph execution problem is a multitask RL problem with a specific form of task parameter called subtask graph. Figure 1 illustrates an example subtask graph and environment. The task of our problem is to execute given subtasks in an optimal order to maximize reward within a time budget, where there are complex dependencies between subtasks defined by the subtask graph. We assume that the agent has learned a set of options (Precup (2000); Stolle and Precup (2002); Sutton et al. (1999) that performs subtasks by executing one or more primitive actions.

Subtask Graph and Environment

We define the terminologies as follows:

  • [leftmargin=*]

  • Precondition: A precondition of subtask is defined as a logical expression of subtasks in sum-of-products (SoP) form where multiple AND terms are combined with an OR term (e.g., the precondition of subtask J in Figure 1 is OR(AND(D), AND(G)).

  • Eligibility vector

    : where if subtask is eligible (i.e., the precondition of subtask is satisfied and it has never been executed by the agent) at time , and otherwise.

  • Completion vector: where if subtask has been executed by the agent while it is eligible, and otherwise.

  • Subtask reward vector: specifies the reward for executing each subtask.

  • Reward: if the agent executes the subtask while it is eligible, and otherwise.

  • Time budget: is the remaining time-steps until episode termination.

  • Observation: is a visual observation at time as illustrated in Figure 1.

To summarize, a subtask graph defines subtasks with corresponding rewards and the preconditions. The state input at time consists of . The goal is to find a policy which maps the given context of the environment to an option ().


Our problem is challenging due to the following aspects:

  • [leftmargin=*]

  • Generalization: Only a subset of subtask graphs () is available during training, but the agent is required to execute previously unseen and larger subtask graphs ().

  • Complex reasoning: The agent needs to infer the long-term effect of executing individual subtasks in terms of reward and cost (e.g., time) and find the optimal sequence of subtasks to execute without any explicit supervision or simulation-based search. We note that it may not be easy even for humans to find the solution without explicit search due to the exponentially large solution space.

  • Stochasticity: The outcome of subtask execution is stochastic in our setting (for example, some objects are randomly moving). Therefore, the agent needs to consider the expected outcome when deciding which subtask to execute.

4 Method

Figure 2: Neural subtask graph solver architecture. The task module encodes subtask graph through a bottom-up and top-down process, and outputs the reward score . The observation module encodes observation using CNN and outputs the cost score . The final policy is a softmax policy over the sum of two scores.

Our neural subtask graph solver (NSGS) is a neural network which consists of a task module and an observation module as shown in Figure 2. The task module encodes the precondition of each subtask via bottom-up process and propagates the information about future subtasks and rewards to preceding subtasks (i.e., pre-conditions) via the top-down process. The observation module learns the correspondence between a subtask and its target object, and the relation between the locations of objects in the observation and the time cost. However, due to the aforementioned challenge (i.e., complex reasoning) in Section 3.2, learning to execute the subtask graph only from the reward is extremely challenging. To facilitate the learning, we propose graph reward propagation policy (GRProp), a non-parametric policy that propagates the reward information between related subtasks to model their dependencies. Since our GRProp acts as a good initial policy, we train the NSGS to approximate the GRProp policy through policy distillation (Rusu et al., 2015; Parisotto et al., 2015)

, and finetune it through actor-critic method with generalized advantage estimation (GAE) 

(Schulman et al., 2015) to maximize the reward. Section 4.1 describes the NSGS architecture, and Section 4.2 describes how to construct the GRProp policy.

4.1 Neural Subtask Graph Solver

Task Module

Given a subtask graph , the remaining time steps , an eligibility vector and a completion vector , we compute a context embedding using recursive-reverse-recursive neural network (R3NN) Parisotto et al. (2016) as follows:


where is a concatenation operator, are the bottom-up and top-down encoding function, are the bottom-up and top-down embedding of -th AND node respectively, and are the bottom-up and top-down embedding of -th OR node respectively (see Appendix for the detail). The , and specifies the connections in the subtask graph . Specifically, if -th OR node and -th AND node are connected without NOT operation, if there is NOT connection and if not connected, and represent a set of -th node’s children and parents respectively. The embeddings are transformed to reward scores via: where , is the dimension of the top-down embedding of OR node, and is a weight vector for reward scoring.

Observation Module

The observation module encodes the input observation

using a convolutional neural network (CNN) and outputs a cost score:


where is the number of remaining time steps. An ideal observation module would learn to estimate high score for a subtask if the target object is close to the agent because it would require less cost (i.e., time). Also, if the expected number of step required to execute a subtask is larger than the remaining step, ideal agent would assign low score. The NSGS policy is a softmax policy:


which adds reward scores and cost scores.

4.2 Graph Reward Propagation Policy: Pre-training Neural Subtask Graph Solver

Intuitively, the graph reward propagation policy is designed to put high probabilities over subtasks that are likely to maximize the sum of

modified and smoothed reward at time , which will be defined in Eq. 9. Let be a completion vector and be a subtask reward vector (see Section 3 for definitions). Then, the sum of reward until time-step is given as:


We first modify the reward formulation such that it gives a half of subtask reward for satisfying the preconditions and the rest for executing the subtask to encourage the agent to satisfy the precondition of a subtask with a large reward:


Let be the output of -th AND node. The eligibility vector can be computed from the subtask graph and as follows:

Figure 3: Visualization of OR, , AND, and operations with three inputs (a,b,c). These smoothed functions are defined to handle arbitrary number of operands (see Appendix).

where if there is a NOT connection between -th node and -th node, otherwise . Intuitively, when -th node does not violate the precondition of -th node. Note that is not differentiable with respect to because AND and OR are not differentiable. To derive our graph reward propagation policy, we propose to substitute AND and OR functions with “smoothed” functions and as follows:


where and were implemented as scaled sigmoid and tanh functions as illustrated by Figure 3 (see Appendix for details). With the smoothed operations, the sum of smoothed and modified reward is given as:


Finally, the graph reward propagation policy is a softmax policy,


that is the softmax of the gradient of with respect to .

4.3 Policy Optimization

The NSGS is first trained through policy distillation by minimizing the KL divergence between NSGS and teacher policy (GRProp) as follows:


where is the parameter of NSGS, is the simplified notation of NSGS policy with subtask graph , is the simplified notation of teacher (GRProp) policy with subtask graph , is KL divergence, and is the training set of subtask graphs. After policy distillation, we finetune NSGS agent in an end-to-end manner using actor-critic method with GAE (Schulman et al., 2015) as follows:


where is the duration of option , is a discount factor,

is a weight for balancing between bias and variance of the advantage estimation, and

is the critic network parameterized by . During training, we update the critic network to minimize , where is the discounted cumulative reward at time . The complete procedure for training our NSGS agent is summarized in Algorithm 1. We used =1e-4, =3e-6 for distillation and =1e-6, =3e-7 for fine-tuning in the experiment.

1:for  iteration  do
2:     Sample
3:      do rollout
4:      update critic
5:     if  distillation  then
6:          update policy
7:     else if fine-tuning then
8:         Compute from Eq. 13 for all
9:          update policy      
Algorithm 1 Policy optimization

5 Experiment

In the experiment, we investigated the following research questions: 1) Does GRProp outperform other heuristic baselines (e.g., greedy policy, etc.)? 2) Can NSGS deal with complex subtask dependencies, delayed reward, and the stochasticity of the environment? 3) Can NSGS generalize to unseen subtask graphs? 4) How does NSGS perform compared to MCTS? 5) Can NSGS be used to improve MCTS?

5.1 Environment

We evaluated the performance of our agents on two domains: Mining and Playground that are developed based on MazeBase (Sukhbaatar et al., 2015)111The code is available on We used a pre-trained subtask executer for each domain. The episode length (time budget) was randomly set for each episode in a range such that GRProp agent executes of subtasks on average. The subtasks in the higher layer in subtask graph are designed to give larger reward (see Appendix for details).

Mining domain is inspired by Minecraft (see Figures 1 and 5). The agent may pickup raw materials in the world, and use it to craft different items on different craft stations. There are two forms of preconditions: 1) an item may be an ingredient for building other items (e.g., stick and stone are ingredients of stone pickaxe), and 2) some tools are required to pick up some objects (e.g., agent need stone pickaxe to mine iron ore). The agent can use the item multiple times after picking it once. The set of subtasks and preconditions are hand-coded based on the crafting recipes in Minecraft, and used as a template to generate 640 random subtask graphs. We used 200 for training and 440 for testing.

Playground is a more flexible and challenging domain (see Figure 6). The subtask graph in Playground was randomly generated, hence its precondition can be any logical expression and the reward may be delayed. Some of the objects randomly move, which makes the environment stochastic. The agent was trained on small subtask graphs, while evaluated on much larger subtask graphs (See Table 1). The set of subtasks is , where is a set of primitive actions to interact with objects, and is a set of all types of interactive objects in the domain. We randomly generated 500 graphs for training and 2,000 graphs for testing. Note that the task in playground domain subsumes many other hierarchical RL domains such as Taxi (Bloch, 2009), Minecraft (Oh et al., 2017) and XWORLD (Yu et al., 2017). In addition, we added the following components into subtask graphs to make the task more challenging:

  • [leftmargin=*]

  • Distractor subtask: A subtask with only NOT connection to parent nodes in the subtask graph. Executing this subtask may give an immediate reward, but it may make other subtasks ineligible.

  • Delayed reward: Agent receives no reward from subtasks in the lower layers, but it should execute some of them to make higher-level subtasks eligible (see Appendix for fully-delayed reward case).

5.2 Agents

2 Subtask Graph Setting
2 Playground Mining
Task D1 D2 D3 D4 Eval
Depth 4 4 5 6 4-10
Subtask 13 15 16 16 10-26
2 Zero-Shot Performance
2 Playground Mining
Task D1 D2 D3 D4 Eval
NSGS (Ours) .820 .785 .715 .527 8.19
GRProp (Ours) .721 .682 .623 .424 6.16
Greedy .164 .144 .178 .228 3.39
Random 0 0 0 0 2.79
2 Adaptation Performance
2 Playground Mining
Task D1 D2 D3 D4 Eval
NSGS (Ours) .828 .797 .733 .552 8.58
Independent .346 .296 .193 .188 3.89
Table 1: Generalization performance on unseen and larger subtask graphs. (Playground) The subtask graphs in D1 have the same graph structure as training set, but the graph was unseen. The subtask graphs in D2, D3, and D4 have (unseen) larger graph structures. (Mining) The subtask graphs in Eval are unseen during training. NSGS outperforms other compared agents on all the task and domain.

We evaluated the following policies:

  • [leftmargin=*]

  • Random policy executes any eligible subtask.

  • Greedy policy executes the eligible subtask with the largest reward.

  • Optimal policy is computed from exhaustive search on eligible subtasks.

  • GRProp (Ours) is graph reward propagation policy.

  • NSGS (Ours) is distilled from GRProp policy and finetuned with actor-critic.

  • Independent is an LSTM-based baseline trained on each subtask graph independently, similar to Independent model in Andreas et al. (2017). It takes the same set of input as NSGS except the subtask graph.

To our best knowledge, existing work on hierarchical RL cannot directly address our problem with a subtask graph input. Instead, we evaluated an instance of hierarchical RL method (Independent agent) in adaptation setting, as discussed in Section 5.3.

5.3 Quantitative Result

Figure 4: Learning curves on Mining and Playground domain. NSGS is distilled from GRProp on 77K and 256K episodes, respectively, and finetuned after that.

Training Performance

The learning curves of NSGS and performance of other agents are shown in Figure 4. Our GRProp policy significantly outperforms the Greedy policy. This implies that the proposed idea of back-propagating the reward gradient captures long-term dependencies among subtasks to some extent. We also found that NSGS further improves the performance through fine-tuning with actor-critic method. We hypothesize that NSGS learned to estimate the expected costs of executing subtasks from the observations and consider them along with subtask graphs.

Figure 5: Example trajectories of Greedy, GRProp, and NSGS agents given 75 steps on Mining domain. We used different colors to indicate that agent has different types of pickaxes: red (no pickaxe), blue (stone pickaxe), and green (iron pickaxe). Greedy agent prefers subtasks C, D, F, and G to H and L since C, D, F, and G gives positive immediate reward, whereas NSGS and GRProp agents find a short path to make stone pickaxe, focusing on subtasks with higher long-term reward. Compared to GRProp, the NSGS agent can find a shorter path to make an iron pickaxe, and succeeds to execute more number of subtasks.
Figure 6: Example trajectories of Greedy, GRProp, and NSGS agents given 45 steps on Playground domain. The subtask graph includes NOT operation and distractor (subtask D, E, and H). We removed stochasticity in environment for the controlled experiment. Greedy agent executes the distractors since they give positive immediate rewards, which makes it impossible to execute the subtask K which gives the largest reward. GRProp and NSGS agents avoid distractors and successfully execute subtask K by satisfying its preconditions. After executing subtask K, the NSGS agent found a shorter path to execute remaining subtasks than the GRProp agent and gets larger reward.

Figure 7: Performance of MCTS+NSGS, MCTS+GRProp and MCTS per the number of simulated steps on (Left) Eval of Mining domain and (Right) D2 of Playground domain (see Table 1).

Generalization Performance

We considered two different types of generalization: a zero-shot setting where agent must immediately achieve good performance on unseen subtask graphs without learning, and an adaptation setting where agent can learn about task through the interaction with environment. Note that Independent agent was evaluated in adaptation setting only since it has no ability to generalize as it does not take subtask graph as input. Particularly, we tested agents on larger subtask graphs by varying the number of layers of the subtask graphs from four to six with a larger number of subtasks on Playground domain. Table 1 summarizes the results in terms of normalized reward where and correspond to the average reward of the Random and the Optimal policy respectively. Due to large number of subtasks (16) in Mining domain, the Optimal policy was intractable to be evaluated. Instead, we reported the un-normalized mean reward. Though the performance degrades as the subtask graph becomes larger as expected, NSGS generalizes well to larger subtask graphs and consistently outperforms all the other agents on Playground and Mining domains in zero-shot setting. In adaptation setting, NSGS performs slightly better than zero-shot setting by fine-tuning on the subtask graphs in evaluation set. Independent agent learned a policy comparable to Greedy, but performs much worse than NSGS.

5.4 Qualitative Result

Figure 5 visualizes trajectories of agents on Mining domain. Greedy policy mostly focuses on subtasks with immediate rewards (e.g., get string, make bow) that are sub-optimal in the long run. In contrast, NSGS and GRProp agents focus on executing subtask H (make stone pickaxe) in order to collect materials much faster in the long run. Compared to GRProp, NSGS learns to consider observation also and avoids subtasks with high cost (e.g., get coal).
Figure 6 visualizes trajectories on Playground domain. In this graph, there are distractors (e.g., D, E, and H) and the reward is delayed. In the beginning, Greedy chooses to execute distractors, since they gives positive reward while subtasks A, B, and C do not. However, GRProp observes non-zero gradient for subtasks A, B, and C that are propagated from the parent nodes. Thus, even though the reward is delayed, GRProp can figure out which subtask to execute. NSGS learns to understand long-term dependencies from GRProp, and finds shorter path by also considering the observation.

5.5 Combining NSGS with Monte-Carlo Tree Search

We further investigated how well our NSGS agent performs compared to conventional search-based methods and how our NSGS agent can be combined with search-based methods to further improve the performance. We implemented the following methods (see Appendix for the detail):

  • [leftmargin=*]

  • MCTS: An MCTS algorithm with UCB (Auer et al., 2002) criterion for choosing actions.

  • MCTS+NSGS: An MCTS algorithm combined with our NSGS agent. NSGS policy was used as a rollout policy to explore reasonably good states during tree search, which is similar to AlphaGo (Silver et al., 2016).

  • MCTS+GRProp: An MCTS algorithm combined with our GRProp agent similar to MCTS+NSGS.

The results are shown in Figure 7. It turns out that our NSGS performs as well as MCTS method with approximately 32K simulations on Playground and 11K simulations on Mining domain, while GRProp performs as well as MCTS with approximately 11K simulations on Playground and 1K simulations on Mining domain. This indicates that our NSGS agent implicitly performs long-term reasoning that is not easily achievable by a sophisticated MCTS, even though NSGS does not use any simulation and has never seen such subtask graphs during training. More interestingly, MCTS+NSGS and MCTS+GRProp significantly outperforms MCTS, and MCTS+NSGS achieves approximately normalized reward with 33K simulations on Playground domain. We found that the Optimal policy, which corresponds to normalized reward of , uses approximately 648M simulations on Playground domain. Thus, MCTS+NSGS performs almost as well as the Optimal policy with only simulations compared to the Optimal policy. This result implies that NSGS can also be used to improve simulation-based planning methods by effectively reducing the search space.

6 Conclusion

We introduced the subtask graph execution problem which is an effective and principled framework of describing complex tasks. To address the difficulty of dealing with complex subtask dependencies, we proposed a graph reward propagation policy derived from a differentiable form of subtask graph, which plays an important role in pre-training our neural subtask graph solver architecture. The empirical results showed that our agent can deal with long-term dependencies between subtasks and generalize well to unseen subtask graphs. In addition, we showed that our agent can be used to effectively reduce the search space of MCTS so that the agent can find a near-optimal solution with a small number of simulations. In this paper, we assumed that the subtask graph (e.g., subtask dependencies and rewards) is given to the agent. However, it will be very interesting future work to investigate how to extend to more challenging scenarios where the subtask graph is unknown (or partially known) and thus need to be estimated through experience.


This work was supported mainly by the ICT R&D program of MSIP/IITP (2016-0-00563: Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion) and partially by DARPA Explainable AI (XAI) program #313498 and Sloan Research Fellowship.


Appendix A Details of the Task

We define each task as an MDP tuple where is a set of states, is a set of actions, is a task-specific state transition function, is a task-specific reward function and is a task-specific initial distribution over states. We describe the subtask graph and each component of MDP in the following paragraphs.

Subtask and Subtask Graph

The subtask graph consists of subtasks that is a subset of , the subtask reward , and the precondition of each subtask. The set of subtasks is , where is a set of primitive actions to interact with objects, and is a set of all types of interactive objects in the domain. To execute a subtask , the agent should move on to the target object and take the primitive action .


The state consists of the observation , the completion vector , the time budget and the eligibility vector . An observation is represented as tensor, where and are the height and width of map respectively, and is the number of object types in the domain. The -th element of observation tensor is if there is an object in on the map, and otherwise. The time budget indicates the number of remaining time-steps until the episode termination. The completion vector and eligibility vector provides additional information about subtasks. The details of completion vector and eligibility vector will be explained in the following paragraph.

State Distribution and Transition Function

Given the current state , the next step state is computed from the subtask graph . In the beginning of episode, the initial time budget is sampled from a pre-specified range for each subtask graph (See section J for detail), the completion vector is initialized to a zero vector in the beginning of the episode and the observation is sampled from the task-specific initial state distribution . Specifically, the observation is generated by randomly placing the agent and the objects corresponding to the subtasks defined in the subtask graph . When the agent executes subtask , the -th element of completion vector is updated by the following update rule:


The observation is updated such that agent moves on to the target object, and perform corresnponding primitive action (See Section I for the full list of subtasks and corresponding primitive actions on Mining and Playground domain). The eligibility vector is computed from the completion vector and subtask graph as follows:


where if there is a NOT connection between -th node and -th node, otherwise . Intuitively, when -th node does not violate the precondition of -th node. Executing each subtask costs different amount of time depending on the map configuration. Specifically, the time cost is given as the Manhattan distance between agent location and target object location in the grid-world plus one more step for performing a primitive action.

Task-specific Reward Function

The reward function is defined in terms of the subtask reward vector and the eligibility vector , where the subtask reward vector is the component of subtask graph the and eligibility vector is computed from the completion vector and subtask graph as Eq. 17. Specifically, when agent executes subtask , the reward given to agent at time step is given as follows:


Appendix B Experiment on Hierarchical Task Network

We compared with our methods with the recent graph-based multitask RL works Hayes and Scassellati [2016], Ghazanfari and Taylor [2017], Huang et al. [2018]. However, these methods cannot be applied to our problem for two main reasons: 1) they aim to solve a single-goal task, which means they can only solve a subset of our problem, and 2) they require search or learning during test time, which means they cannot be applied in zero-shot generalization setting. Specifically, each trajectory in single-goal task is assumed to be labeled as success or failure depending on whether the goal was achieved or not, which is necessary for these methods Hayes and Scassellati [2016], Ghazanfari and Taylor [2017], Huang et al. [2018] to infer the task structure (e.g., hierarchical task network (HTN) [Sacerdoti, 1975]). Since our task setting is more general and not limited to a single goal task, the task structure with multiple goals cannot be inferred with these methods.

For a direct comparison, we simplified our problem into single-goal task as follows. 1) We set a single goal; set all the subtask reward to 0, except the top-level subtask, and set it as terminal state. 2) We removed the cost, time budget, and observation, and set . After constructing the task network such as HTN, these methods Hayes and Scassellati [2016], Ghazanfari and Taylor [2017], Huang et al. [2018] execute task by planning Hayes and Scassellati [2016] or learning a policy Ghazanfari and Taylor [2017], Huang et al. [2018] during test stage. Accordingly, we evaluated HTN-plan method Hayes and Scassellati [2016] in planning setting, and allowed learning in test time for Ghazanfari and Taylor [2017], Huang et al. [2018]. Note that these methods cannot execute a task in zero-shot setting, while our NSGS can do it by learning an embedding of subtask graph; it is the main reason why our method performs much better than these methods in the following two experiments.

Figure 8: Planning performance of MCTS+NSGS, MCTS+GRProp and HTN-Plan on HTN subtask graph in Playground domain. Adaptation (HTN) 2 Method NSGS (Ours) .90 HTN-Independent .31 Table 2: Adaptation performance (normalized reward) of NSGS and HTN-Independent on HTN subtask graph in Playground domain.

b.1 Comparison with HTN-Planning

Hayes and Scassellati [2016] performed planning on the inferred task network to find the optimal solution. Thus, we implemented HTN-Plan with MCTS as in section 5.5, and compared with ours in planning setting. We evaluated our MCTS+NSGS and MCTS+GRProp for comparison. The figure shows that our MCTS+NSGS and MCTS+GRProp agents outperform HTN-Plan by a large margin.

b.2 Comparison with HTN-based Agent

Instead of planning, Ghazanfari and Taylor [2017] learned an hierarchical RL (HRL) agent on the constructed HTN during testing. Thus, we evaluated it in adaptation setting (i.e., learning during test time). To this end, we implemented an HRL agent, HTN-Independent, which is a policy over option trained on each subtask graph independently, similar to Independent agent (see section 5.2). The result shows that our NSGS agent can find the solution much faster than HTN-Independent agent due to zero-shot generalization ability.

Huang et al. [2018] inferred the subtask graph from the visual demonstration in testing. Since the environment state is available in our setting, providing demonstration amounts to providing the solution. Thus we couldn’t compare with it.

Appendix C Details of NSGS Architecture

Figure 9: An example of R3NN construction for a given subtask graph input. The four encoders ( and ) are cloned and connected according to the input subtask graph where the cloned models share the weight. For simplicity, only the output embeddings of bottom-up and top-down OR encoder were specified in the figure.

Task module

Figure 9 illustrates the structure of the task module of NSGS architecture for a given input subtask graph. Specifically, the task module was implemented with four encoders: and . The input and output of each encoder is defined in the main text section 4.1 as:


For bottom-up process, the encoder takes the output embeddings of its children encoders as input. Similarly, for top-down process, the encoder takes the output embeddings of its parent encoders as input. The input embeddings are aggregated by taking element-wise summation. For and , the embeddings are concatenated with to deal with NOT connection before taking the element-wise summation. Then, the summed embedding is concatenated with all additional input as defined in Eq. 19 and 20, which is further transformed with three fully-connected layers with 128 units. The last fully-connected layer outputs 128-dimensional output embedding. The embeddings are transformed to reward scores as via: where , is the dimension of the top-down embedding of OR node, and is a weight vector for reward scoring. Similarly, the reward baseline is computed by , where sum() is the reduced-sum operation and

is the weight vector for reward baseline. We used parametric ReLU (PReLU) function as activation function.

Observation module

The network consists of BN1-Conv1(16x1x1-1/0)-BN2-Conv2(32x3x3-1/1)-BN3-Conv3(64x3x3-1/1)-BN4-Conv4(96x3x3-1/1)-BN5-Conv5(128x3x3-1/1)-BN6-Conv6(64x1x1-1/0)-FC(256). The output embedding of FC(256) was then concatenated with the number of remaining time step . Finally, the network has two fully-connected output layers for the cost score and the cost baseline . Then, the policy of NSGS is calculated by adding reward score and cost score, and taking softmax:


The baseline output is obtained by adding reward baseline and cost baseline:


Appendix D Details of Learning NSGS Agent

Learning objectives

The NSGS architecture is first trained through policy distillation and finetuned using actor-critic method with generalized advantage estimator. During policy distillation, the KL divergence between NSGS and teacher policy (GRProp) is minimized as follows:


where is the parameter of NSGS architecture, is the simplified notation of NSGS policy with subtask graph input , is the simplified notation of teacher (GRProp) policy with subtask graph input , and is the training set of subtask graphs.

For both policy distillation and fine-tuning, we sampled one subtask graph for each 16 parallel workers, and each worker in turn sample a mini-batch of 16 world configurations (maps). Then, NSGS generates total 256 episodes in parallel. After generating episode, the gradient from 256 episodes are collected and averaged, and then back-propagated to update the parameter. For policy distillation, we trained NSGS for 40 epochs where each epoch involves 100 times of update. Since our GRProp policy observes only the subtask graph, we only trained task module during policy distillation. The observation module was trained for auxiliary prediction task; observation module predicts the number of step taken by agent to execute each subtask.

After policy distillation, we finetune NSGS agent in an end-to-end manner using actor-critic method with generalized advantage estimation (GAE) [Schulman et al., 2015] as follows:


where is the duration of option , is a discount factor, is a weight for balancing between bias and variance of the advantage estimation, and is the critic network parameterized by . During training, we update the critic network to minimize , where is the discounted cumulative reward at time .


For both finetuning and policy distillation, we used RMSProp optimizer with the smoothing parameter of 0.97 and epsilon of 1e-6. When distilling agent with teacher policy, we used learning rate=1e-4 and multiplied it by 0.97 on every epoch for both Mining and Playground domain. For finetuning, we used learning rate=2.5e-6 for Playground domain, and 2e-7 for Mining domain. For actor-critic training for NSGS, we used


Appendix E Details of AND/OR Operation and Approximated AND/OR Operation

In section 4.2, the output of -th AND and OR node in subtask graph were defined using AND and OR operation with multiple input. They can be represented in logical expression as below:


where are the elements of a set and is the set of inputs coming from the children nodes of -th node. Then, these AND and OR operations are smoothed as below:


where , ,

is sigmoid function, and

are hyperparameters to be set. We used

for Mining domain, and for Playground domain.

Appendix F Details of Subtask Executor


The subtask executor has the same architecture of the parameterized skill architecture of Oh et al. [2017] with slightly different hyperparameters. The network consists of Conv1(32x3x3-1/1)-Conv2(32x3x3-1/1)-Conv3(32x1x1-1/0)-Conv4(32x3x3-1/1)-LSTM(256)-FC(256). The subtask executor takes two task parameters as additional input and computes

to compute the subtask embedding, and further linearly transformed into the weights of Conv3 and the (factorized) weight of LSTM through multiplicative interaction as described above. Finally, the network has three fully-connected output layers for actions, termination probability, and baseline, respectively.

Learning objective

The subtask executor is trained through policy distillation and then finetuned. Similar to [Oh et al., 2017], we first trained 16 teacher policy network for each subtask. The teacher policy network consists of Conv1(16x3x3-1/1)-BN1(16)-Conv2(16x3x3-1/1)-BN2(16)-Conv3(16x3x3-1/1)-BN3(16)-LSTM(128)-FC(128). Similar to subtask executor network, the teacher policy network has three fully-connected output layers for actions, termination probability, and baseline, respectively. Then, the learned teacher policy networks are used as teacher policy for policy distillation to train subtask executor. During policy distillation, we train agent to minimize the following objective function:


where is the parameter of subtask executor network, is the simplified notation of subtask executor given input subtask , is the simplified notation of teacher policy for subtask , is the cross entropy loss of predicting termination, is a set of state in which the subtask is terminated, is the termination probability output, and . After policy distillation, we finetuned subtask executor using actor-critic method with generalized advantage estimation (GAE):


where is a discount factor, is a weight for balancing between bias and variance of the advantage estimation, and . We used for fine-tuning, and for both policy distillation and fine-tuning.

Appendix G Details of LSTM Baseline


The LSTM baseline consists of LSTM on top of CNN. The architecture of CNN is the same as the CNN architecture of observation module of NSGS described in the section C, and the architecture of LSTM is the same as the LSTM architecture used in subtask executor described in the section F. Specifically, it consists of BN1-Conv1(16x1x1-1/0)-BN2-Conv2(32x3x3-1/1)-BN3-Conv3(64x3x3-1/1)-BN4-Conv4(96x3x3-1/1)-BN5-Conv5(128x3x3-1/1)-BN6-Conv6(64x1x1-1/0)-LSTM(256)-FC(256). The CNN takes the observation tensor as an input and outputs an embedding. The embedding is then concatenated with other input vectors including subtask completion indicator , eligibility vector , and the remaining step . Finally, LSTM takes the concatenated vector as an input and output the softmax policy with the parameter : .

Learning objective

The LSTM baseline was trained using actor-critic method. For the baseline, we found that the moving average of return works much better than learning a critic network, and used it for experiment. This is due to the characteristic of adaptation setting; in adaptation setting, the subtask graph is fixed and the agent is trained for only a small number of episodes such that the critic network is usually under-fitted. Similar to NSGS, the learning objective is given as


where is a discount factor, is a weight for balancing between bias and variance of the advantage estimation, , and is the moving average of return at time step . We used and .

Appendix H Details of Search Algorithms

Each iteration of Monte-Carlo tree search method consists of four stages: selection, expansion, rollout, and back-propagation.

  • Selection: We used UCB criterion Auer et al. [2002]. Specifically, the option for which the score below has the highest value is chosen for selection:


    where is the accumulated return at -th node, is the number of visit of -th node, is the exploration-exploitation balancing weight, and is the number of total iterations so far. We found that gives the best result and used it for MCTS, MCTS+GRProp and MCTS+NSGS methods.

  • Expansion: MCTS randomly chooses the remaining eligible subtask, while the subtask is chosen by NSGS policy for MCTS+NSGS method and GRProp policy for MTS+GRProp method. More specifically, MCTS+NSGS and MCTS+GRProp greedily chooses among the remaining subtasks based on NSGS and GRProp policy, respectively. Due to the memory limit, the expansion of search tree was truncated at the depth of 7 for Playground and 10 for Mining domains, and performed rollout after the maximum depth.

  • Rollout: MCTS randomly executes an eligible subtask, while MCTS+NSGS and MCTS+GRProp execute the subtask with the highest probability given by NSGS and GRProp policies, respectively.

  • Back-propagation: Once the episode is terminated, the result is back-propagated; the accumulated return and the visit count are updated for the nodes in the tree that agent visited within the episode, and the number of total iteration is updated as .

Appendix I Details of Environment

i.1 Mining

There are 15 types of objects: Mountain, Water, Work space, Furnace, Tree, Stone, Grass, Pig, Coal, Iron, Silver, Gold, Diamond, Jeweler’s shop, and Lumber shop. The agent can take 10 primitive actions: up, down, left, right, pickup, use1, use2, use3, use4, use5 and agent cannot moves on to the Mountain and Water cell. Pickup removes the object under the agent, and use’s do not change the observation. There are 26 subtasks in the Mining domain:

  • Get wood/stone/string/pork/coal/iron/silver/gold/diamond: The agent should go to Tree/Stone/Grass/Pig/Coal/Iron/Silver/Gold/Diamond respectively, and take pickup action.

  • Make firewood/stick/arrow/bow: The agent should go to Lumber shop and take use1/use2/use3/use4 action respectively.

  • Light furnace: The agent should go to Furnace and take use1 action.

  • Smelt iron/silver/gold: The agent should go to Furnace and take use2/use3/use4 action respectively.

  • Make stone-pickaxe/iron-pickaxe/silverware/goldware/bracelet: The agent should go to Work space and take use1/use2/use3/use4/use5 action respectively.

  • Make earrings/ring/necklace: The agent should go to Jeweler’s shop and take use1/use2/use3 action respectively.

The icons used in Mining domain were downloaded from and The Diamond and Furnace icons were made by Freepik from

i.2 Playground

There are 10 types of objects: Cow, Milk, Duck, Egg, Diamond, Heart, Box, Meat, Block, and Ice. The Cow and Duck move by 1 pixel in random direction with the probability of 0.1 and 0.2, respectively. The agent can take 6 primitive actions: up, down, left, right, pickup, transform and agent cannot moves on to the block cell. Pickup removes the object under the agent, and transform changes the object under the agent to Ice. The subtask graph was randomly generated without any hand-coded template (see Section J for details).

Appendix J Details of Subtask Graph Generation

j.1 Mining Domain

Figure 10: The entire graph of Mining domain. Based on this graph, we generated 640 subtask graphs by removing the subtask node that has no parent node.

The precondition of each subtask in Mining domain was defined as Figure 10. Based on this graph, we generated all possible sub-graphs of it by removing the subtask node that has no parent node, while always keeping subtasks A, B, D, E, F, G, H, I, K, L. The reward of each subtask was randomly scaled by a factor of .

j.2 Playground Domain

number of tasks in each layer
Nodes number of distractors in each layer
number of AND node in each layer
reward of subtasks in each layer
number of children of AND node in each layer
number of children of AND node with NOT connection in each layer
Edges number of parents with NOT connection of distractors in each layer
number of children of OR node in each layer
Episode number of step given for each episode
Table 3: Parameters for generating task including subtask graph parameter and episode length.

For training and test sample generation, the subtask graph structure was defined in terms of the parameters in table 3. To cover wide range of subtask graphs, we randomly sampled the parameters , and from the range specified in the table 4 and 6, while and was manually set. We prevented the graph from including the duplicated AND nodes with the same children node(s). We carefully set the range of each parameter such that at least 500 different subtask graphs can be generated with the given parameter ranges. The table 4 summarizes parameters used to generate training and evaluation subtask graphs for the Playground domain.

Train {1,1,1}-{3,3,3}
(=D1) {0,0,0}-{2,2,1}
D2 {1,1,1}-{3,3,3}
D3 {1,1,1,1}-{3,3,3,3}
D4 {1,1,1,1,1}-{3,3,3,3,3}
Table 4: Subtask graph parameters for training set and tasks D1D4.

Appendix K Ablation Study on Neural Subtask Graph Solver Agent

k.1 Learning without Pre-training

2 Zero-Shot Performance
2 Playground() Mining()
Task D1 D2 D3 D4 Eval
NSGS (Ours) .820 .785 .715 .527 8.19
NSGS-task (Ours) .773 .730 .645 .387 6.51
GRProp (Ours) .721 .682 .623 .424 6.16
NSGS-scratch (Ours) .046 .056 .062 .106 3.68
Random 0 0 0 0 2.79
Table 5: Zero-shot generalization performance on Playground and Mining domain. NSGS-scratch agent performs much worse than NSGS and GRProp agent on Playground and Mining domain.

We implemented NSGS-scratch agent that is trained with actor-critic method from scratch without pre-training from GRProp policy to show that pre-training plays a crucial role for training our NSGS agent. Table 5 summarizes the result. NSGS-scratch performs much worse than NSGS, suggesting that pre-training is important in training NSGS. This is not surprising as our problem is combinatorially intractable (e.g. searching over optimal sequence of subtasks given an unseen subtask graph).

k.2 Ablation Study on the Balance between Task and Observation Module

We implemented NSGS-task agent that uses only the task module without observation module to compare the contribution of task module and observation module of NSGS agent. Overall, our NSGS agent outperforms the NSGS-task agent, showing that the observation module improves the performance by a large margin.

Appendix L Experiment Result on Subtask Graph Features

Figure 11: Normalized performance on subtask graphs with different types of dependencies.

To investigate how agents deal with different types of subtask graph components, we evaluated all agents on the following types of subtask graphs:

  • [leftmargin=*]

  • ‘Base’ set consists of subtask graphs with AND and OR operations, but without NOT operation.

  • ‘Base-OR’ set removes all the OR operations from the base set.

  • ‘Base+Distractor’ set adds several distractor subtasks to the base set.

  • ‘Base+NOT’ set adds several NOT operations to the base set.

  • ‘Base+NegDistractor’ set adds several negative distractor subtasks to the base set.

  • ‘Base+Delayed’ set assigns zero reward to all subtasks but the top-layer subtask.

Note that we further divided the set of Distractor into Distractor and NegDistractor. The distractor subtask is a subtask without any parent node in the subtask graph. Executing this kind of subtask may give an immediate reward but is sub-optimal in the long run. The negative-distractor subtask is a subtask with only and at least one NOT connection to parent nodes in the subtask graph. Executing this subtask may give an immediate reward, but this would make other subtasks not executable. Table 6 summarizes the detailed parameters used for generating subtask graphs. The results are shown in Figure 11. Since ‘Base’ and ‘Base-OR’ sets do not contain NOT operation and every subtask gives a positive reward, the greedy baseline performs reasonably well compared to other sets of subtask graphs. It is also shown that the gap between NSGS and GRProp is relatively large in these two sets. This is because computing the optimal ordering between subtasks is more important in these kinds of subtask graphs. Since only NSGS can take into account the cost of each subtask from the observation, it can find a better sequence of subtasks more often.

In ‘Base+Distractor’, ‘Base+NOT’, and ‘Base+NegDistractor’ cases, it is more important for the agent to carefully find and execute subtasks that have a positive effect in the long run while avoiding distractors that are not helpful for executing future subtasks. In these tasks, the greedy baseline tends to execute distractors very often because it cannot consider the long-term effect of each subtask in principle. On the other hand, our GRProp can naturally screen out distractors by getting zero or negative gradient during reward back-propagation. Similarly, GRProp performs well on ‘Base+Delayed’ set because it gets non-zero gradients for all subtasks that are connected to the final rewarding subtask. Since our NSGS was distilled from GRProp, it can handle delayed reward or distractors as well as (or better than) GRProp.

Base {1,1,2}-{3,2,2}
-OR {1,1,1}-{1,1,1}
+Distractor {2,1,0,0}
+NOT {0,0,0}-{3,2,2}
+NegDistractor {2,1,0,0}