1 Introduction
Developing the ability to execute many different tasks depending on given task descriptions and generalize over unseen task descriptions is an important problem for building scalable reinforcement learning (RL) agents. Recently, there have been a few attempts to define and solve different forms of task descriptions such as natural language (Oh et al., 2017; Yu et al., 2017) or formal language (Denil et al., 2017; Andreas et al., 2017). However, most of the prior works have focused on task descriptions which explicitly specify what the agent should do at a high level, which may not be readily available in realworld applications.
To further motivate the problem, let’s consider a scenario in which an agent needs to generalize to a complex novel task by performing a composition of subtasks where the task description and dependencies among subtasks may change depending on the situation. For example, a human user could ask a physical household robot to make a meal in an hour. A meal may be served with different combinations of dishes, each of which takes a different amount of cost (e.g., time) and gives a different amount of reward (e.g., user satisfaction) depending on the user preferences. In addition, there can be complex dependencies between subtasks. For example, a bread should be sliced before toasted, or an omelette and an egg sandwich cannot be made together if there is only one egg left. Due to such complex dependencies as well as different rewards and costs, it is often cumbersome for human users to manually provide the optimal sequence of subtasks (e.g., “fry an egg and toast a bread”). Instead, the agent should learn to act in the environment by figuring out the optimal sequence of subtasks that gives the maximum reward within a time budget just from properties and dependencies of subtasks.
The goal of this paper is to formulate and solve such a problem, which we call subtask graph execution, where the agent should execute the given subtask graph in an optimal way as illustrated in Figure 1. A subtask graph consists of subtasks, corresponding rewards, and dependencies among subtasks in logical expression form where it subsumes many existing forms (e.g., sequential instructions (Oh et al., 2017)). This allows us to define many complex tasks in a principled way and train the agent to find the optimal way of executing such tasks. Moreover, we aim to solve the problem without explicit search or simulations so that our method can be more easily applicable to practical realworld scenarios, where realtime performance (i.e., fast decisionmaking) is required and building the simulation model is extremely challenging.
To solve the problem, we propose a new deep RL architecture, called neural subtask graph solver (NSGS), which encodes a subtask graph using a recursivereverserecursive neural network (R3NN) (Parisotto et al., 2016) to consider the longterm effect of each subtask. Still, finding the optimal sequence of subtasks by reflecting the longterm dependencies between subtasks and the context of observation is computationally intractable. Therefore, we found that it is extremely challenging to learn a good policy when it’s trained from scratch. To address the difficulty of learning, we propose to pretrain the NSGS to approximate our novel nonparametric policy called graph reward propagation policy. The key idea of the graph reward propagation policy is to construct a differentiable representation of the subtask graph such that taking a gradient over the reward results in propagating reward information between related subtasks, which is used to find a reasonably good subtask to execute. After the pretraining, our NSGS architecture is finetuned using the actorcritic method.
The experimental results on 2D visual domains with diverse subtask graphs show that our agent implicitly performs complex reasoning by taking into account longterm subtask dependencies as well as the cost of executing each subtask from the observation, and it can successfully generalize to unseen and larger subtask graphs. Finally, we show that our method is computationally much more efficient than MonteCarlo tree search (MCTS) algorithm, and the performance of our NSGS agent can be further improved by combining with MCTS, achieving a nearoptimal performance.
Our contributions can be summarized as follows: (1) We propose a new challenging RL problem and domain with a richer and more general form of graphbased task descriptions compared to the recent works on multitask RL. (2) We propose a deep RL architecture that can execute arbitrary unseen subtask graphs and observations. (3) We demonstrate that our method outperforms the stateoftheart searchbased method (e.g., MCTS), which implies that our method can efficiently approximate the solution of an intractable search problem without performing any search. (4) We further show that our method can also be used to augment MCTS, which significantly improves the performance of MCTS with a much less amount of simulations.
2 Related Work
Programmable Agent
The idea of learning to execute a given program using RL was introduced by programmable hierarchies of abstract machines (PHAMs) (Parr and Russell, 1997; Andre and Russell, 2000, 2002). PHAMs specify a partial policy using a set of hierarchical finite state machines, and the agent learns to execute the partial program. A different way of specifying a partial policy was explored in the deep RL framework (Andreas et al., 2017). Other approaches used a program as a form of task description rather than a partial policy in the context of multitask RL (Oh et al., 2017; Denil et al., 2017). Our work also aims to build a programmable agent in that we train the agent to execute a given task. However, most of the prior work assumes that the program specifies what to do, and the agent just needs to learn how to do it. In contrast, our work explores a new form of program, called subtask graph (see Figure 1), which describes properties of subtasks and dependencies between them, and the agent is required to figure out what to do as well as how to do it.
Hierarchical Reinforcement Learning
Many hierarchical RL approaches have been proposed to solve complex decision problems via multiple levels of temporal abstractions (Sutton et al., 1999; Dietterich, 2000; Precup, 2000; Ghavamzadeh and Mahadevan, 2003; Konidaris and Barto, 2007). Our work builds upon the prior work in that a highlevel controller focuses on finding the optimal subtask, while a lowlevel controller focuses on executing the given subtask. In this work, we focus on how to train the highlevel controller for generalizing to novel complex dependencies between subtasks.
Classical SearchBased Planning
One of the most closely related problems is the planning problem considered in hierarchical task network (HTN) approaches (Sacerdoti, 1975; Erol, 1996; Erol et al., 1994; Nau et al., 1999; Castillo et al., 2005) in that HTNs also aim to find the optimal way to execute tasks given subtask dependencies. However, they aim to execute a single goal task, while the goal of our problem is to maximize the cumulative reward in RL context. Thus, the agent in our problem not only needs to consider dependencies among subtasks but also needs to infer the cost from the observation and deal with stochasticity of the environment. These additional challenges make it difficult to apply such classical planning methods to solve our problem.
Motion Planning
Another related problem to our subtask graph execution problem is motion planning (MP) problem (Asano et al., 1985; Canny, 1985, 1987; Faverjon and Tournassoud, 1987; Keil and Sack, 1985). MP problem is often mapped to a graph, and reduced to a graph search problem. However, different from our problem, the MP approaches aim to find an optimal path to the goal in the graph while avoiding obstacles similar to HTN approaches.
3 Problem Definition
3.1 Preliminary: Multitask Reinforcement Learning and ZeroShot Generalization
We consider an agent presented with a task drawn from some distribution as in Andreas et al. (2017); Da Silva et al. (2012)
. We model each task as Markov Decision Process (MDP). Let
be a task parameter available to agent drawn from a distribution where defines the task and is a set of all possible task parameters. The goal is to maximize the expected reward over the whole distribution of MDPs: , where is the expected return of the policy given a task defined by , is a discount factor, is a multitask policy that we aim to learn, and is the reward at time step . We consider a zeroshot generalization where only a subset of tasks is available to agent during training, and the agent is required to generalize over a set of unseen tasks for evaluation, where .3.2 Subtask Graph Execution Problem
The subtask graph execution problem is a multitask RL problem with a specific form of task parameter called subtask graph. Figure 1 illustrates an example subtask graph and environment. The task of our problem is to execute given subtasks in an optimal order to maximize reward within a time budget, where there are complex dependencies between subtasks defined by the subtask graph. We assume that the agent has learned a set of options () Precup (2000); Stolle and Precup (2002); Sutton et al. (1999) that performs subtasks by executing one or more primitive actions.
Subtask Graph and Environment
We define the terminologies as follows:

[leftmargin=*]

Precondition: A precondition of subtask is defined as a logical expression of subtasks in sumofproducts (SoP) form where multiple AND terms are combined with an OR term (e.g., the precondition of subtask J in Figure 1 is OR(AND(D), AND(G)).

Eligibility vector
: where if subtask is eligible (i.e., the precondition of subtask is satisfied and it has never been executed by the agent) at time , and otherwise. 
Completion vector: where if subtask has been executed by the agent while it is eligible, and otherwise.

Subtask reward vector: specifies the reward for executing each subtask.

Reward: if the agent executes the subtask while it is eligible, and otherwise.

Time budget: is the remaining timesteps until episode termination.

Observation: is a visual observation at time as illustrated in Figure 1.
To summarize, a subtask graph defines subtasks with corresponding rewards and the preconditions. The state input at time consists of . The goal is to find a policy which maps the given context of the environment to an option ().
Challenges
Our problem is challenging due to the following aspects:

[leftmargin=*]

Generalization: Only a subset of subtask graphs () is available during training, but the agent is required to execute previously unseen and larger subtask graphs ().

Complex reasoning: The agent needs to infer the longterm effect of executing individual subtasks in terms of reward and cost (e.g., time) and find the optimal sequence of subtasks to execute without any explicit supervision or simulationbased search. We note that it may not be easy even for humans to find the solution without explicit search due to the exponentially large solution space.

Stochasticity: The outcome of subtask execution is stochastic in our setting (for example, some objects are randomly moving). Therefore, the agent needs to consider the expected outcome when deciding which subtask to execute.
4 Method
Our neural subtask graph solver (NSGS) is a neural network which consists of a task module and an observation module as shown in Figure 2. The task module encodes the precondition of each subtask via bottomup process and propagates the information about future subtasks and rewards to preceding subtasks (i.e., preconditions) via the topdown process. The observation module learns the correspondence between a subtask and its target object, and the relation between the locations of objects in the observation and the time cost. However, due to the aforementioned challenge (i.e., complex reasoning) in Section 3.2, learning to execute the subtask graph only from the reward is extremely challenging. To facilitate the learning, we propose graph reward propagation policy (GRProp), a nonparametric policy that propagates the reward information between related subtasks to model their dependencies. Since our GRProp acts as a good initial policy, we train the NSGS to approximate the GRProp policy through policy distillation (Rusu et al., 2015; Parisotto et al., 2015)
, and finetune it through actorcritic method with generalized advantage estimation (GAE)
(Schulman et al., 2015) to maximize the reward. Section 4.1 describes the NSGS architecture, and Section 4.2 describes how to construct the GRProp policy.4.1 Neural Subtask Graph Solver
Task Module
Given a subtask graph , the remaining time steps , an eligibility vector and a completion vector , we compute a context embedding using recursivereverserecursive neural network (R3NN) Parisotto et al. (2016) as follows:
(1)  
(2) 
where is a concatenation operator, are the bottomup and topdown encoding function, are the bottomup and topdown embedding of th AND node respectively, and are the bottomup and topdown embedding of th OR node respectively (see Appendix for the detail). The , and specifies the connections in the subtask graph . Specifically, if th OR node and th AND node are connected without NOT operation, if there is NOT connection and if not connected, and represent a set of th node’s children and parents respectively. The embeddings are transformed to reward scores via: where , is the dimension of the topdown embedding of OR node, and is a weight vector for reward scoring.
Observation Module
The observation module encodes the input observation
using a convolutional neural network (CNN) and outputs a cost score:
(3) 
where is the number of remaining time steps. An ideal observation module would learn to estimate high score for a subtask if the target object is close to the agent because it would require less cost (i.e., time). Also, if the expected number of step required to execute a subtask is larger than the remaining step, ideal agent would assign low score. The NSGS policy is a softmax policy:
(4) 
which adds reward scores and cost scores.
4.2 Graph Reward Propagation Policy: Pretraining Neural Subtask Graph Solver
Intuitively, the graph reward propagation policy is designed to put high probabilities over subtasks that are likely to maximize the sum of
modified and smoothed reward at time , which will be defined in Eq. 9. Let be a completion vector and be a subtask reward vector (see Section 3 for definitions). Then, the sum of reward until timestep is given as:(5) 
We first modify the reward formulation such that it gives a half of subtask reward for satisfying the preconditions and the rest for executing the subtask to encourage the agent to satisfy the precondition of a subtask with a large reward:
(6) 
Let be the output of th AND node. The eligibility vector can be computed from the subtask graph and as follows:
(7) 
where if there is a NOT connection between th node and th node, otherwise . Intuitively, when th node does not violate the precondition of th node. Note that is not differentiable with respect to because AND and OR are not differentiable. To derive our graph reward propagation policy, we propose to substitute AND and OR functions with “smoothed” functions and as follows:
(8) 
where and were implemented as scaled sigmoid and tanh functions as illustrated by Figure 3 (see Appendix for details). With the smoothed operations, the sum of smoothed and modified reward is given as:
(9) 
Finally, the graph reward propagation policy is a softmax policy,
(10) 
that is the softmax of the gradient of with respect to .
4.3 Policy Optimization
The NSGS is first trained through policy distillation by minimizing the KL divergence between NSGS and teacher policy (GRProp) as follows:
(11) 
where is the parameter of NSGS, is the simplified notation of NSGS policy with subtask graph , is the simplified notation of teacher (GRProp) policy with subtask graph , is KL divergence, and is the training set of subtask graphs. After policy distillation, we finetune NSGS agent in an endtoend manner using actorcritic method with GAE (Schulman et al., 2015) as follows:
(12)  
(13) 
where is the duration of option , is a discount factor,
is a weight for balancing between bias and variance of the advantage estimation, and
is the critic network parameterized by . During training, we update the critic network to minimize , where is the discounted cumulative reward at time . The complete procedure for training our NSGS agent is summarized in Algorithm 1. We used =1e4, =3e6 for distillation and =1e6, =3e7 for finetuning in the experiment.5 Experiment
In the experiment, we investigated the following research questions: 1) Does GRProp outperform other heuristic baselines (e.g., greedy policy, etc.)? 2) Can NSGS deal with complex subtask dependencies, delayed reward, and the stochasticity of the environment? 3) Can NSGS generalize to unseen subtask graphs? 4) How does NSGS perform compared to MCTS? 5) Can NSGS be used to improve MCTS?
5.1 Environment
We evaluated the performance of our agents on two domains: Mining and Playground that are developed based on MazeBase (Sukhbaatar et al., 2015)^{1}^{1}1The code is available on https://github.com/srsohn/subtaskgraphexecution. We used a pretrained subtask executer for each domain. The episode length (time budget) was randomly set for each episode in a range such that GRProp agent executes of subtasks on average. The subtasks in the higher layer in subtask graph are designed to give larger reward (see Appendix for details).
Mining domain is inspired by Minecraft (see Figures 1 and 5). The agent may pickup raw materials in the world, and use it to craft different items on different craft stations. There are two forms of preconditions: 1) an item may be an ingredient for building other items (e.g., stick and stone are ingredients of stone pickaxe), and 2) some tools are required to pick up some objects (e.g., agent need stone pickaxe to mine iron ore). The agent can use the item multiple times after picking it once. The set of subtasks and preconditions are handcoded based on the crafting recipes in Minecraft, and used as a template to generate 640 random subtask graphs. We used 200 for training and 440 for testing.
Playground is a more flexible and challenging domain (see Figure 6). The subtask graph in Playground was randomly generated, hence its precondition can be any logical expression and the reward may be delayed. Some of the objects randomly move, which makes the environment stochastic. The agent was trained on small subtask graphs, while evaluated on much larger subtask graphs (See Table 1). The set of subtasks is , where is a set of primitive actions to interact with objects, and is a set of all types of interactive objects in the domain. We randomly generated 500 graphs for training and 2,000 graphs for testing. Note that the task in playground domain subsumes many other hierarchical RL domains such as Taxi (Bloch, 2009), Minecraft (Oh et al., 2017) and XWORLD (Yu et al., 2017). In addition, we added the following components into subtask graphs to make the task more challenging:

[leftmargin=*]

Distractor subtask: A subtask with only NOT connection to parent nodes in the subtask graph. Executing this subtask may give an immediate reward, but it may make other subtasks ineligible.

Delayed reward: Agent receives no reward from subtasks in the lower layers, but it should execute some of them to make higherlevel subtasks eligible (see Appendix for fullydelayed reward case).
5.2 Agents
2 Subtask Graph Setting  

2  Playground  Mining  
Task  D1  D2  D3  D4  Eval 
Depth  4  4  5  6  410 
Subtask  13  15  16  16  1026 
2 ZeroShot Performance  
2  Playground  Mining  
Task  D1  D2  D3  D4  Eval 
NSGS (Ours)  .820  .785  .715  .527  8.19 
GRProp (Ours)  .721  .682  .623  .424  6.16 
Greedy  .164  .144  .178  .228  3.39 
Random  0  0  0  0  2.79 
2 Adaptation Performance  
2  Playground  Mining  
Task  D1  D2  D3  D4  Eval 
NSGS (Ours)  .828  .797  .733  .552  8.58 
Independent  .346  .296  .193  .188  3.89 
We evaluated the following policies:

[leftmargin=*]

Random policy executes any eligible subtask.

Greedy policy executes the eligible subtask with the largest reward.

Optimal policy is computed from exhaustive search on eligible subtasks.

GRProp (Ours) is graph reward propagation policy.

NSGS (Ours) is distilled from GRProp policy and finetuned with actorcritic.

Independent is an LSTMbased baseline trained on each subtask graph independently, similar to Independent model in Andreas et al. (2017). It takes the same set of input as NSGS except the subtask graph.
To our best knowledge, existing work on hierarchical RL cannot directly address our problem with a subtask graph input. Instead, we evaluated an instance of hierarchical RL method (Independent agent) in adaptation setting, as discussed in Section 5.3.
5.3 Quantitative Result
Training Performance
The learning curves of NSGS and performance of other agents are shown in Figure 4. Our GRProp policy significantly outperforms the Greedy policy. This implies that the proposed idea of backpropagating the reward gradient captures longterm dependencies among subtasks to some extent. We also found that NSGS further improves the performance through finetuning with actorcritic method. We hypothesize that NSGS learned to estimate the expected costs of executing subtasks from the observations and consider them along with subtask graphs.
Generalization Performance
We considered two different types of generalization: a zeroshot setting where agent must immediately achieve good performance on unseen subtask graphs without learning, and an adaptation setting where agent can learn about task through the interaction with environment. Note that Independent agent was evaluated in adaptation setting only since it has no ability to generalize as it does not take subtask graph as input. Particularly, we tested agents on larger subtask graphs by varying the number of layers of the subtask graphs from four to six with a larger number of subtasks on Playground domain. Table 1 summarizes the results in terms of normalized reward where and correspond to the average reward of the Random and the Optimal policy respectively. Due to large number of subtasks (16) in Mining domain, the Optimal policy was intractable to be evaluated. Instead, we reported the unnormalized mean reward. Though the performance degrades as the subtask graph becomes larger as expected, NSGS generalizes well to larger subtask graphs and consistently outperforms all the other agents on Playground and Mining domains in zeroshot setting. In adaptation setting, NSGS performs slightly better than zeroshot setting by finetuning on the subtask graphs in evaluation set. Independent agent learned a policy comparable to Greedy, but performs much worse than NSGS.
5.4 Qualitative Result
Figure 5 visualizes trajectories of agents on Mining domain. Greedy policy mostly focuses on subtasks with immediate rewards (e.g., get string, make bow) that are suboptimal in the long run. In contrast, NSGS and GRProp agents focus on executing subtask H (make stone pickaxe) in order to collect materials much faster in the long run. Compared to GRProp, NSGS learns to consider observation also and avoids subtasks with high cost (e.g., get coal).
Figure 6 visualizes trajectories on Playground domain. In this graph, there are distractors (e.g., D, E, and H) and the reward is delayed. In the beginning, Greedy chooses to execute distractors, since they gives positive reward while subtasks A, B, and C do not. However, GRProp observes nonzero gradient for subtasks A, B, and C that are propagated from the parent nodes. Thus, even though the reward is delayed, GRProp can figure out which subtask to execute. NSGS learns to understand longterm dependencies from GRProp, and finds shorter path by also considering the observation.
5.5 Combining NSGS with MonteCarlo Tree Search
We further investigated how well our NSGS agent performs compared to conventional searchbased methods and how our NSGS agent can be combined with searchbased methods to further improve the performance. We implemented the following methods (see Appendix for the detail):

[leftmargin=*]

MCTS: An MCTS algorithm with UCB (Auer et al., 2002) criterion for choosing actions.

MCTS+NSGS: An MCTS algorithm combined with our NSGS agent. NSGS policy was used as a rollout policy to explore reasonably good states during tree search, which is similar to AlphaGo (Silver et al., 2016).

MCTS+GRProp: An MCTS algorithm combined with our GRProp agent similar to MCTS+NSGS.
The results are shown in Figure 7. It turns out that our NSGS performs as well as MCTS method with approximately 32K simulations on Playground and 11K simulations on Mining domain, while GRProp performs as well as MCTS with approximately 11K simulations on Playground and 1K simulations on Mining domain. This indicates that our NSGS agent implicitly performs longterm reasoning that is not easily achievable by a sophisticated MCTS, even though NSGS does not use any simulation and has never seen such subtask graphs during training. More interestingly, MCTS+NSGS and MCTS+GRProp significantly outperforms MCTS, and MCTS+NSGS achieves approximately normalized reward with 33K simulations on Playground domain. We found that the Optimal policy, which corresponds to normalized reward of , uses approximately 648M simulations on Playground domain. Thus, MCTS+NSGS performs almost as well as the Optimal policy with only simulations compared to the Optimal policy. This result implies that NSGS can also be used to improve simulationbased planning methods by effectively reducing the search space.
6 Conclusion
We introduced the subtask graph execution problem which is an effective and principled framework of describing complex tasks. To address the difficulty of dealing with complex subtask dependencies, we proposed a graph reward propagation policy derived from a differentiable form of subtask graph, which plays an important role in pretraining our neural subtask graph solver architecture. The empirical results showed that our agent can deal with longterm dependencies between subtasks and generalize well to unseen subtask graphs. In addition, we showed that our agent can be used to effectively reduce the search space of MCTS so that the agent can find a nearoptimal solution with a small number of simulations. In this paper, we assumed that the subtask graph (e.g., subtask dependencies and rewards) is given to the agent. However, it will be very interesting future work to investigate how to extend to more challenging scenarios where the subtask graph is unknown (or partially known) and thus need to be estimated through experience.
Acknowledgments
This work was supported mainly by the ICT R&D program of MSIP/IITP (2016000563: Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion) and partially by DARPA Explainable AI (XAI) program #313498 and Sloan Research Fellowship.
References
 Oh et al. [2017] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zeroshot task generalization with multitask deep reinforcement learning. ICML, 2017.
 Yu et al. [2017] Haonan Yu, Haichao Zhang, and Wei Xu. A deep compositional framework for humanlike language acquisition in virtual environment. arXiv:1703.09831, 2017.
 Denil et al. [2017] Misha Denil, Sergio Gómez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas. Programmable agents. arXiv:1706.06383, 2017.
 Andreas et al. [2017] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. ICML, 2017.
 Parisotto et al. [2016] Emilio Parisotto, Abdelrahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neurosymbolic program synthesis. arXiv:1611.01855, 2016.
 Parr and Russell [1997] Ronald Parr and Stuart J. Russell. Reinforcement learning with hierarchies of machines. NIPS, 1997.
 Andre and Russell [2000] David Andre and Stuart J. Russell. Programmable reinforcement learning agents. NIPS, 2000.
 Andre and Russell [2002] David Andre and Stuart J. Russell. State abstraction for programmable reinforcement learning agents. AAAI/IAAI, 2002.
 Sutton et al. [1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 1999.
 Dietterich [2000] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. JAIR, 2000.
 Precup [2000] Doina Precup. Temporal abstraction in reinforcement learning. PhD thesis, 2000.
 Ghavamzadeh and Mahadevan [2003] Mohammad Ghavamzadeh and Sridhar Mahadevan. Hierarchical policy gradient algorithms. ICML, 2003.
 Konidaris and Barto [2007] George Konidaris and Andrew G. Barto. Building portable options: Skill transfer in reinforcement learning. IJCAI, 2007.
 Sacerdoti [1975] Earl D Sacerdoti. The nonlinear nature of plans. Technical report, Stanford Research Institute, Menlo Park, CA, 1975.
 Erol [1996] Kutluhan Erol. Hierarchical task network planning: formalization, analysis, and implementation. PhD thesis, 1996.
 Erol et al. [1994] Kutluhan Erol, James A Hendler, and Dana S Nau. Umcp: A sound and complete procedure for hierarchical tasknetwork planning. AIPS, 1994.
 Nau et al. [1999] Dana Nau, Yue Cao, Amnon Lotem, and Hector MunozAvila. Shop: Simple hierarchical ordered planner. IJCAI, 1999.
 Castillo et al. [2005] Luis Castillo, Juan FdezOlivares, Óscar GarcíaPérez, and Francisco Palao. Temporal enhancements of an htn planner. CAEPIA, 2005.
 Asano et al. [1985] Takao Asano, Tetsuo Asano, Leonidas Guibas, John Hershberger, and Hiroshi Imai. Visibilitypolygon search and euclidean shortest paths. FOCS, 1985.
 Canny [1985] John Canny. A voronoi method for the pianomovers problem. ICRA, 1985.
 Canny [1987] John Canny. A new algebraic method for robot motion planning and real geometry. FOCS, 1987.

Faverjon and Tournassoud [1987]
Bernard Faverjon and Pierre Tournassoud.
A local based approach for path planning of manipulators with a high number of degrees of freedom.
ICRA, 1987. 
Keil and Sack [1985]
J Mark Keil and JorgR Sack.
Minimum decompositions of polygonal objects.
Machine Intelligence and Pattern Recognition
, 1985.  Da Silva et al. [2012] Bruno Da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. arXiv:1206.6398, 2012.
 Stolle and Precup [2002] Martin Stolle and Doina Precup. Learning options in reinforcement learning. ISARA, 2002.
 Rusu et al. [2015] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv:1511.06295, 2015.
 Parisotto et al. [2015] Emilio Parisotto, Jimmy Ba, and Ruslan Salakhutdinov. Actormimic: Deep multitask and transfer reinforcement learning. ArXiv, 2015.
 Schulman et al. [2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv:1506.02438, 2015.
 Sukhbaatar et al. [2015] Sainbayar Sukhbaatar, Arthur Szlam, Gabriel Synnaeve, Soumith Chintala, and Rob Fergus. Mazebase: A sandbox for learning from games. arXiv:1511.07401, 2015.
 Bloch [2009] Mitchell Keith Bloch. Hierarchical reinforcement learning in the taxicab domain. Technical report, Center for Cognitive Architecture, University of Michigan, 2009.
 Auer et al. [2002] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 2002.
 Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
 Hayes and Scassellati [2016] Bradley Hayes and Brian Scassellati. Autonomously constructing hierarchical task networks for planning and humanrobot collaboration. ICRA, 2016.
 Ghazanfari and Taylor [2017] Behzad Ghazanfari and Matthew E Taylor. Autonomous extracting a hierarchical structure of tasks in reinforcement learning and multitask reinforcement learning. arXiv:1709.04579, 2017.
 Huang et al. [2018] DeAn Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li FeiFei, Silvio Savarese, and Juan Carlos Niebles. Neural task graphs: Generalizing to unseen tasks from a single video demonstration. arXiv:1807.03480, 2018.
Appendix A Details of the Task
We define each task as an MDP tuple where is a set of states, is a set of actions, is a taskspecific state transition function, is a taskspecific reward function and is a taskspecific initial distribution over states. We describe the subtask graph and each component of MDP in the following paragraphs.
Subtask and Subtask Graph
The subtask graph consists of subtasks that is a subset of , the subtask reward , and the precondition of each subtask. The set of subtasks is , where is a set of primitive actions to interact with objects, and is a set of all types of interactive objects in the domain. To execute a subtask , the agent should move on to the target object and take the primitive action .
State
The state consists of the observation , the completion vector , the time budget and the eligibility vector . An observation is represented as tensor, where and are the height and width of map respectively, and is the number of object types in the domain. The th element of observation tensor is if there is an object in on the map, and otherwise. The time budget indicates the number of remaining timesteps until the episode termination. The completion vector and eligibility vector provides additional information about subtasks. The details of completion vector and eligibility vector will be explained in the following paragraph.
State Distribution and Transition Function
Given the current state , the next step state is computed from the subtask graph . In the beginning of episode, the initial time budget is sampled from a prespecified range for each subtask graph (See section J for detail), the completion vector is initialized to a zero vector in the beginning of the episode and the observation is sampled from the taskspecific initial state distribution . Specifically, the observation is generated by randomly placing the agent and the objects corresponding to the subtasks defined in the subtask graph . When the agent executes subtask , the th element of completion vector is updated by the following update rule:
(14) 
The observation is updated such that agent moves on to the target object, and perform corresnponding primitive action (See Section I for the full list of subtasks and corresponding primitive actions on Mining and Playground domain). The eligibility vector is computed from the completion vector and subtask graph as follows:
(15)  
(16)  
(17) 
where if there is a NOT connection between th node and th node, otherwise . Intuitively, when th node does not violate the precondition of th node. Executing each subtask costs different amount of time depending on the map configuration. Specifically, the time cost is given as the Manhattan distance between agent location and target object location in the gridworld plus one more step for performing a primitive action.
Taskspecific Reward Function
The reward function is defined in terms of the subtask reward vector and the eligibility vector , where the subtask reward vector is the component of subtask graph the and eligibility vector is computed from the completion vector and subtask graph as Eq. 17. Specifically, when agent executes subtask , the reward given to agent at time step is given as follows:
(18) 
Appendix B Experiment on Hierarchical Task Network
We compared with our methods with the recent graphbased multitask RL works Hayes and Scassellati [2016], Ghazanfari and Taylor [2017], Huang et al. [2018]. However, these methods cannot be applied to our problem for two main reasons: 1) they aim to solve a singlegoal task, which means they can only solve a subset of our problem, and 2) they require search or learning during test time, which means they cannot be applied in zeroshot generalization setting. Specifically, each trajectory in singlegoal task is assumed to be labeled as success or failure depending on whether the goal was achieved or not, which is necessary for these methods Hayes and Scassellati [2016], Ghazanfari and Taylor [2017], Huang et al. [2018] to infer the task structure (e.g., hierarchical task network (HTN) [Sacerdoti, 1975]). Since our task setting is more general and not limited to a single goal task, the task structure with multiple goals cannot be inferred with these methods.
For a direct comparison, we simplified our problem into singlegoal task as follows. 1) We set a single goal; set all the subtask reward to 0, except the toplevel subtask, and set it as terminal state. 2) We removed the cost, time budget, and observation, and set . After constructing the task network such as HTN, these methods Hayes and Scassellati [2016], Ghazanfari and Taylor [2017], Huang et al. [2018] execute task by planning Hayes and Scassellati [2016] or learning a policy Ghazanfari and Taylor [2017], Huang et al. [2018] during test stage. Accordingly, we evaluated HTNplan method Hayes and Scassellati [2016] in planning setting, and allowed learning in test time for Ghazanfari and Taylor [2017], Huang et al. [2018]. Note that these methods cannot execute a task in zeroshot setting, while our NSGS can do it by learning an embedding of subtask graph; it is the main reason why our method performs much better than these methods in the following two experiments.
Adaptation (HTN) 2 Method NSGS (Ours) .90 HTNIndependent .31
b.1 Comparison with HTNPlanning
Hayes and Scassellati [2016] performed planning on the inferred task network to find the optimal solution. Thus, we implemented HTNPlan with MCTS as in section 5.5, and compared with ours in planning setting. We evaluated our MCTS+NSGS and MCTS+GRProp for comparison. The figure shows that our MCTS+NSGS and MCTS+GRProp agents outperform HTNPlan by a large margin.
b.2 Comparison with HTNbased Agent
Instead of planning, Ghazanfari and Taylor [2017] learned an hierarchical RL (HRL) agent on the constructed HTN during testing. Thus, we evaluated it in adaptation setting (i.e., learning during test time). To this end, we implemented an HRL agent, HTNIndependent, which is a policy over option trained on each subtask graph independently, similar to Independent agent (see section 5.2). The result shows that our NSGS agent can find the solution much faster than HTNIndependent agent due to zeroshot generalization ability.
Huang et al. [2018] inferred the subtask graph from the visual demonstration in testing. Since the environment state is available in our setting, providing demonstration amounts to providing the solution. Thus we couldn’t compare with it.
Appendix C Details of NSGS Architecture
Task module
Figure 9 illustrates the structure of the task module of NSGS architecture for a given input subtask graph. Specifically, the task module was implemented with four encoders: and . The input and output of each encoder is defined in the main text section 4.1 as:
(19)  
(20) 
For bottomup process, the encoder takes the output embeddings of its children encoders as input. Similarly, for topdown process, the encoder takes the output embeddings of its parent encoders as input. The input embeddings are aggregated by taking elementwise summation. For and , the embeddings are concatenated with to deal with NOT connection before taking the elementwise summation. Then, the summed embedding is concatenated with all additional input as defined in Eq. 19 and 20, which is further transformed with three fullyconnected layers with 128 units. The last fullyconnected layer outputs 128dimensional output embedding. The embeddings are transformed to reward scores as via: where , is the dimension of the topdown embedding of OR node, and is a weight vector for reward scoring. Similarly, the reward baseline is computed by , where sum() is the reducedsum operation and
is the weight vector for reward baseline. We used parametric ReLU (PReLU) function as activation function.
Observation module
The network consists of BN1Conv1(16x1x11/0)BN2Conv2(32x3x31/1)BN3Conv3(64x3x31/1)BN4Conv4(96x3x31/1)BN5Conv5(128x3x31/1)BN6Conv6(64x1x11/0)FC(256). The output embedding of FC(256) was then concatenated with the number of remaining time step . Finally, the network has two fullyconnected output layers for the cost score and the cost baseline . Then, the policy of NSGS is calculated by adding reward score and cost score, and taking softmax:
(21) 
The baseline output is obtained by adding reward baseline and cost baseline:
(22) 
Appendix D Details of Learning NSGS Agent
Learning objectives
The NSGS architecture is first trained through policy distillation and finetuned using actorcritic method with generalized advantage estimator. During policy distillation, the KL divergence between NSGS and teacher policy (GRProp) is minimized as follows:
(23) 
where is the parameter of NSGS architecture, is the simplified notation of NSGS policy with subtask graph input , is the simplified notation of teacher (GRProp) policy with subtask graph input , and is the training set of subtask graphs.
For both policy distillation and finetuning, we sampled one subtask graph for each 16 parallel workers, and each worker in turn sample a minibatch of 16 world configurations (maps). Then, NSGS generates total 256 episodes in parallel. After generating episode, the gradient from 256 episodes are collected and averaged, and then backpropagated to update the parameter. For policy distillation, we trained NSGS for 40 epochs where each epoch involves 100 times of update. Since our GRProp policy observes only the subtask graph, we only trained task module during policy distillation. The observation module was trained for auxiliary prediction task; observation module predicts the number of step taken by agent to execute each subtask.
After policy distillation, we finetune NSGS agent in an endtoend manner using actorcritic method with generalized advantage estimation (GAE) [Schulman et al., 2015] as follows:
(24)  
(25) 
where is the duration of option , is a discount factor, is a weight for balancing between bias and variance of the advantage estimation, and is the critic network parameterized by . During training, we update the critic network to minimize , where is the discounted cumulative reward at time .
Hyperparameters
For both finetuning and policy distillation, we used RMSProp optimizer with the smoothing parameter of 0.97 and epsilon of 1e6. When distilling agent with teacher policy, we used learning rate=1e4 and multiplied it by 0.97 on every epoch for both Mining and Playground domain. For finetuning, we used learning rate=2.5e6 for Playground domain, and 2e7 for Mining domain. For actorcritic training for NSGS, we used
.Appendix E Details of AND/OR Operation and Approximated AND/OR Operation
In section 4.2, the output of th AND and OR node in subtask graph were defined using AND and OR operation with multiple input. They can be represented in logical expression as below:
(26)  
(27) 
where are the elements of a set and is the set of inputs coming from the children nodes of th node. Then, these AND and OR operations are smoothed as below:
(28)  
(29) 
where , ,
is sigmoid function, and
are hyperparameters to be set. We used
for Mining domain, and for Playground domain.Appendix F Details of Subtask Executor
Architecture
The subtask executor has the same architecture of the parameterized skill architecture of Oh et al. [2017] with slightly different hyperparameters. The network consists of Conv1(32x3x31/1)Conv2(32x3x31/1)Conv3(32x1x11/0)Conv4(32x3x31/1)LSTM(256)FC(256). The subtask executor takes two task parameters as additional input and computes
to compute the subtask embedding, and further linearly transformed into the weights of Conv3 and the (factorized) weight of LSTM through multiplicative interaction as described above. Finally, the network has three fullyconnected output layers for actions, termination probability, and baseline, respectively.
Learning objective
The subtask executor is trained through policy distillation and then finetuned. Similar to [Oh et al., 2017], we first trained 16 teacher policy network for each subtask. The teacher policy network consists of Conv1(16x3x31/1)BN1(16)Conv2(16x3x31/1)BN2(16)Conv3(16x3x31/1)BN3(16)LSTM(128)FC(128). Similar to subtask executor network, the teacher policy network has three fullyconnected output layers for actions, termination probability, and baseline, respectively. Then, the learned teacher policy networks are used as teacher policy for policy distillation to train subtask executor. During policy distillation, we train agent to minimize the following objective function:
(30) 
where is the parameter of subtask executor network, is the simplified notation of subtask executor given input subtask , is the simplified notation of teacher policy for subtask , is the cross entropy loss of predicting termination, is a set of state in which the subtask is terminated, is the termination probability output, and . After policy distillation, we finetuned subtask executor using actorcritic method with generalized advantage estimation (GAE):
(31) 
where is a discount factor, is a weight for balancing between bias and variance of the advantage estimation, and . We used for finetuning, and for both policy distillation and finetuning.
Appendix G Details of LSTM Baseline
Architecture
The LSTM baseline consists of LSTM on top of CNN. The architecture of CNN is the same as the CNN architecture of observation module of NSGS described in the section C, and the architecture of LSTM is the same as the LSTM architecture used in subtask executor described in the section F. Specifically, it consists of BN1Conv1(16x1x11/0)BN2Conv2(32x3x31/1)BN3Conv3(64x3x31/1)BN4Conv4(96x3x31/1)BN5Conv5(128x3x31/1)BN6Conv6(64x1x11/0)LSTM(256)FC(256). The CNN takes the observation tensor as an input and outputs an embedding. The embedding is then concatenated with other input vectors including subtask completion indicator , eligibility vector , and the remaining step . Finally, LSTM takes the concatenated vector as an input and output the softmax policy with the parameter : .
Learning objective
The LSTM baseline was trained using actorcritic method. For the baseline, we found that the moving average of return works much better than learning a critic network, and used it for experiment. This is due to the characteristic of adaptation setting; in adaptation setting, the subtask graph is fixed and the agent is trained for only a small number of episodes such that the critic network is usually underfitted. Similar to NSGS, the learning objective is given as
(32) 
where is a discount factor, is a weight for balancing between bias and variance of the advantage estimation, , and is the moving average of return at time step . We used and .
Appendix H Details of Search Algorithms
Each iteration of MonteCarlo tree search method consists of four stages: selection, expansion, rollout, and backpropagation.

Selection: We used UCB criterion Auer et al. [2002]. Specifically, the option for which the score below has the highest value is chosen for selection:
(33) where is the accumulated return at th node, is the number of visit of th node, is the explorationexploitation balancing weight, and is the number of total iterations so far. We found that gives the best result and used it for MCTS, MCTS+GRProp and MCTS+NSGS methods.

Expansion: MCTS randomly chooses the remaining eligible subtask, while the subtask is chosen by NSGS policy for MCTS+NSGS method and GRProp policy for MTS+GRProp method. More specifically, MCTS+NSGS and MCTS+GRProp greedily chooses among the remaining subtasks based on NSGS and GRProp policy, respectively. Due to the memory limit, the expansion of search tree was truncated at the depth of 7 for Playground and 10 for Mining domains, and performed rollout after the maximum depth.

Rollout: MCTS randomly executes an eligible subtask, while MCTS+NSGS and MCTS+GRProp execute the subtask with the highest probability given by NSGS and GRProp policies, respectively.

Backpropagation: Once the episode is terminated, the result is backpropagated; the accumulated return and the visit count are updated for the nodes in the tree that agent visited within the episode, and the number of total iteration is updated as .
Appendix I Details of Environment
i.1 Mining
There are 15 types of objects: Mountain, Water, Work space, Furnace, Tree, Stone, Grass, Pig, Coal, Iron, Silver, Gold, Diamond, Jeweler’s shop, and Lumber shop. The agent can take 10 primitive actions: up, down, left, right, pickup, use1, use2, use3, use4, use5 and agent cannot moves on to the Mountain and Water cell. Pickup removes the object under the agent, and use’s do not change the observation. There are 26 subtasks in the Mining domain:

Get wood/stone/string/pork/coal/iron/silver/gold/diamond: The agent should go to Tree/Stone/Grass/Pig/Coal/Iron/Silver/Gold/Diamond respectively, and take pickup action.

Make firewood/stick/arrow/bow: The agent should go to Lumber shop and take use1/use2/use3/use4 action respectively.

Light furnace: The agent should go to Furnace and take use1 action.

Smelt iron/silver/gold: The agent should go to Furnace and take use2/use3/use4 action respectively.

Make stonepickaxe/ironpickaxe/silverware/goldware/bracelet: The agent should go to Work space and take use1/use2/use3/use4/use5 action respectively.

Make earrings/ring/necklace: The agent should go to Jeweler’s shop and take use1/use2/use3 action respectively.
The icons used in Mining domain were downloaded from www.icons8.com and www.flaticon.com. The Diamond and Furnace icons were made by Freepik from www.flaticon.com.
i.2 Playground
There are 10 types of objects: Cow, Milk, Duck, Egg, Diamond, Heart, Box, Meat, Block, and Ice. The Cow and Duck move by 1 pixel in random direction with the probability of 0.1 and 0.2, respectively. The agent can take 6 primitive actions: up, down, left, right, pickup, transform and agent cannot moves on to the block cell. Pickup removes the object under the agent, and transform changes the object under the agent to Ice. The subtask graph was randomly generated without any handcoded template (see Section J for details).
Appendix J Details of Subtask Graph Generation
j.1 Mining Domain
The precondition of each subtask in Mining domain was defined as Figure 10. Based on this graph, we generated all possible subgraphs of it by removing the subtask node that has no parent node, while always keeping subtasks A, B, D, E, F, G, H, I, K, L. The reward of each subtask was randomly scaled by a factor of .
j.2 Playground Domain
number of tasks in each layer  
Nodes  number of distractors in each layer  
number of AND node in each layer  
reward of subtasks in each layer  
number of children of AND node in each layer  
number of children of AND node with NOT connection in each layer  
Edges  number of parents with NOT connection of distractors in each layer  
number of children of OR node in each layer  
Episode  number of step given for each episode 
For training and test sample generation, the subtask graph structure was defined in terms of the parameters in table 3. To cover wide range of subtask graphs, we randomly sampled the parameters , and from the range specified in the table 4 and 6, while and was manually set. We prevented the graph from including the duplicated AND nodes with the same children node(s). We carefully set the range of each parameter such that at least 500 different subtask graphs can be generated with the given parameter ranges. The table 4 summarizes parameters used to generate training and evaluation subtask graphs for the Playground domain.
{6,4,2,1}  
{2,1,0,0}  
{3,3,2}{5,4,2}  
Train  {1,1,1}{3,3,3}  
(=D1)  {0,0,0}{2,2,1}  
{0,0,0}{3,3,0}  
{1,1,1}{2,2,2}  
{0.1,0.3,0.7,1.8}{0.2,0.4,0.9,2.0}  
4872  
{7,5,2,1}  
{2,2,0,0}  
{4,3,2}{5,4,2}  
D2  {1,1,1}{3,3,3}  
{0,0,0}{2,2,1}  
{0,0,0,0}{3,3,0,0}  
{1,1,1}{2,2,2}  
{0.1,0.3,0.7,1.8}{0.2,0.4,0.9,2.0}  
5278  
{5,4,4,2,1}  
{1,1,1,0,0}  
{3,3,3,2}{5,4,4,2}  
D3  {1,1,1,1}{3,3,3,3}  
{0,0,0,0}{2,2,1,1}  
{0,0,0,0,0}{3,3,3,0,0}  
{1,1,1,1}{2,2,2,2}  
{0.1,0.3,0.6,1.0,2.0}{0.2,0.4,0.7,1.2,2.2}  
5684  
{4,3,3,3,2,1}  
{0,0,0,0,0,0}  
{3,3,3,3,2}{5,4,4,4,2}  
D4  {1,1,1,1,1}{3,3,3,3,3}  
{0,0,0,0,0}{2,2,1,1,0}  
{0,0,0,0,0,0}{0,0,0,0,0,0}  
{1,1,1,1,1}{2,2,2,2,2}  
{0.1,0.3,0.6,1.0,1.4,2.4}{0.2,0.4,0.7,1.2,1.6,2.6}  
5684 
Appendix K Ablation Study on Neural Subtask Graph Solver Agent
k.1 Learning without Pretraining
2 ZeroShot Performance  

2  Playground()  Mining()  
Task  D1  D2  D3  D4  Eval 
NSGS (Ours)  .820  .785  .715  .527  8.19 
NSGStask (Ours)  .773  .730  .645  .387  6.51 
GRProp (Ours)  .721  .682  .623  .424  6.16 
NSGSscratch (Ours)  .046  .056  .062  .106  3.68 
Random  0  0  0  0  2.79 
We implemented NSGSscratch agent that is trained with actorcritic method from scratch without pretraining from GRProp policy to show that pretraining plays a crucial role for training our NSGS agent. Table 5 summarizes the result. NSGSscratch performs much worse than NSGS, suggesting that pretraining is important in training NSGS. This is not surprising as our problem is combinatorially intractable (e.g. searching over optimal sequence of subtasks given an unseen subtask graph).
k.2 Ablation Study on the Balance between Task and Observation Module
We implemented NSGStask agent that uses only the task module without observation module to compare the contribution of task module and observation module of NSGS agent. Overall, our NSGS agent outperforms the NSGStask agent, showing that the observation module improves the performance by a large margin.
Appendix L Experiment Result on Subtask Graph Features
To investigate how agents deal with different types of subtask graph components, we evaluated all agents on the following types of subtask graphs:

[leftmargin=*]

‘Base’ set consists of subtask graphs with AND and OR operations, but without NOT operation.

‘BaseOR’ set removes all the OR operations from the base set.

‘Base+Distractor’ set adds several distractor subtasks to the base set.

‘Base+NOT’ set adds several NOT operations to the base set.

‘Base+NegDistractor’ set adds several negative distractor subtasks to the base set.

‘Base+Delayed’ set assigns zero reward to all subtasks but the toplayer subtask.
Note that we further divided the set of Distractor into Distractor and NegDistractor. The distractor subtask is a subtask without any parent node in the subtask graph. Executing this kind of subtask may give an immediate reward but is suboptimal in the long run. The negativedistractor subtask is a subtask with only and at least one NOT connection to parent nodes in the subtask graph. Executing this subtask may give an immediate reward, but this would make other subtasks not executable. Table 6 summarizes the detailed parameters used for generating subtask graphs. The results are shown in Figure 11. Since ‘Base’ and ‘BaseOR’ sets do not contain NOT operation and every subtask gives a positive reward, the greedy baseline performs reasonably well compared to other sets of subtask graphs. It is also shown that the gap between NSGS and GRProp is relatively large in these two sets. This is because computing the optimal ordering between subtasks is more important in these kinds of subtask graphs. Since only NSGS can take into account the cost of each subtask from the observation, it can find a better sequence of subtasks more often.
In ‘Base+Distractor’, ‘Base+NOT’, and ‘Base+NegDistractor’ cases, it is more important for the agent to carefully find and execute subtasks that have a positive effect in the long run while avoiding distractors that are not helpful for executing future subtasks. In these tasks, the greedy baseline tends to execute distractors very often because it cannot consider the longterm effect of each subtask in principle. On the other hand, our GRProp can naturally screen out distractors by getting zero or negative gradient during reward backpropagation. Similarly, GRProp performs well on ‘Base+Delayed’ set because it gets nonzero gradients for all subtasks that are connected to the final rewarding subtask. Since our NSGS was distilled from GRProp, it can handle delayed reward or distractors as well as (or better than) GRProp.
{4,3,2,1}  
{0,0,0,0}  
{3,3,2}{4,3,3}  
Base  {1,1,2}{3,2,2}  
{0,0,0}{0,0,0}  
{0,0,0,0}{0,0,0,0}  
{1,1,1}{2,2,2}  
4060  
OR  {1,1,1}{1,1,1}  
+Distractor  {2,1,0,0}  
+NOT  {0,0,0}{3,2,2}  
+NegDistractor  {2,1,0,0}  
Comments
There are no comments yet.