Control What You Can: Intrinsically Motivated Task-Planning Agent

06/19/2019 ∙ by Sebastian Blaes, et al. ∙ Max Planck Society 0

We present a novel intrinsically motivated agent that learns how to control the environment in the fastest possible manner by optimizing learning progress. It learns what can be controlled, how to allocate time and attention, and the relations between objects using surprise based motivation. The effectiveness of our method is demonstrated in a synthetic as well as a robotic manipulation environment yielding considerably improved performance and smaller sample complexity. In a nutshell, our work combines several task-level planning agent structures (backtracking search on task graph, probabilistic road-maps, allocation of search efforts) with intrinsic motivation to achieve learning from scratch.



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper studies how to make an autonomous agent learn to gain maximal control of its environment under little external reward. To answer this question, we turn to the true learning experts: children. How do they solve this problem? They play, often with any objects within their reach. The purpose may not be immediately clear to us. But to play is to manipulate, to gain control. In the same spirit of this cognitive developmental process, we specifically design an agent that is 1) intrinsically motivated to gain control of the environment 2) capable of learning its own curriculum and to reason about object relations in a way that was not done before.

As a motivational example, consider an environment with a heavy object that cannot be moved without using a tool such as a forklift, as depicted in Fig. 1(LABEL:comp:env). The agent needs to be able to control itself and the tool, and use it to move the heavy object. In the beginning, we do not assume the agent has knowledge of the tool, object, or physics. It needs to learn from scratch that is highly challenging for current algorithms. Without external rewards, an agent may be driven by intrinsic motivation (IM) to gain control over its own internal representation of the world, which includes itself and objects in the environment. It often faces a decision of what to attempt to learn with limited time and attention: if there are several objects that can be manipulated, which one should be dealt with first? In our approach the scheduling is solved by an automatic curriculum that aims at improving learning progress. The learning progress, may have its unique advantage over other quantities such as prediction error (curiosity): it renders unsolvable tasks uninteresting as soon as progress stalls.

Instead of an end-to-end architecture, we adopt a core reasoning structure about tasks and subgoals. Inspired by the task-level planning methods from the robotics and AI planning communities, we model the agent using planning architecture in the form of chained subtasks. In practice, this is modeled as a task graph as in Fig. 1(LABEL:comp:depgraph). In order to perform manipulation, the agent and the tool need to be in a specific relation. Our agent learns such relationships by an attention mechanism bootstrapped by surprise detection.

Our main contributions are:

  1. [itemsep=0em, topsep=.1em]

  2. We propose to use maximizing controllability and surprise as intrinsic motivation for solving challenging control problems. The computational effectiveness of this cognitive development-inspired approach is empirically demonstrated.

  3. We propose to adopt several task-level planning ideas (backtracking search on task graph/goal regression, probabilistic road-maps, allocation of search efforts) for designing IM agents to achieve task completion and skill acquisition from scratch.

To our knowledge, no prior IM study has adopted similar controllability and task-planning insights. The contributions are validated through 1) a synthetic environment, with exhaustive analysis and ablation, that cannot be solved by state-of-the-art methods even with oracle rewards; 2) a robotic manipulation environment where tool manipulation is necessary.

2 Related Work

In this section, we give a survey on the recent computational approaches to intrinsic motivation (IM). This is by no means comprehensive due to the large body of literature on this topic. Generally speaking, there are a few types of IM in literature: learning progress (competence, empowerment), curiosity (surprise, prediction error), self-play (adversarial generation), auxiliary tasks, maximizing information theoretical quantities, etc. To help readers clearly understand the relation between our work and the literature, we provide the following table.

Intrinsic motivation Computational methods
CWYC Ours learning progress + surprise task-level planning, relational attention
h-DQN kulkarni2016hierarchical reaching subgoals HRL, DQN
IMGEP forestier2017intrinsically learning progress memory-based
CURIOUS Colas2018:CURIOUS learning progress DDPG, HER, E-UVFA
SAC-X riedmiller2018learning auxiliary task HRL, (DDPG-like) PI
Relational RL zambaldi2018relational - relation net, IMPALA
ICM pathak2017curiosity prediction error A3C, ICM
Goal GAN florensa2018automatic adversarial goal GAN, TRPO
Asymmetric self-play sukhbaatar2017intrinsic self-play Alice/Bob, TRPO, REINFORCE

Learning progress

describes the rate of change in agent’s gaining competence in certain skills. It is a heuristic for measuring interests inspired by observing human children. This is the focus of many recent studies

(schmidhuber1991possibility, ; Kaplan2004Maximizing-Learning-Progress:, ; Schmidhuber:06cs, ; Oudeyer2007Intrinsic-Motivation-Systems, ; BaranesOudeyer2013:ActiveGoalExploration, ; forestier2017intrinsically, ; Colas2018:CURIOUS, ). Our work can be thought of as an instantiation using maximizing controllability and a task-planning structure. Empowerment klyubin2005empowerment proposes the quantity that measures the control of the agent over its future sensory input. Curiosity, as a form of IM, is usually modeled as the prediction error by the agent’s world model. For example, in challenging video game domains it can lead to remarkable success pathak2017curiosity or to learn options chentanez2005intrinsically . self-play as IM, where two agents engage in an adversarial game, was demonstrated to increase learning speed sukhbaatar2017intrinsic . This is also related to the idea of using GANs for goal generation as in florensa2018automatic .

Recently, auxiliary prediction tasks were used to aid representation learningjaderberg2016reinforcement . In comparison, our goal is not to train the feature representation but to study the developmental process; our concept of controllability is also unique. Similarly, informed auxiliary tasks as a form of IM was considered in riedmiller2018learning . Many RL tasks can be formulated as aiming to maximize certain information theoretical quantities sunyi2011agi ; Little2013Learning-and-exploration-in-action-perception ; ZahediMartiusAy2013 ; haarnoja2017reinforcement . In contrast, we focus on IM inspired by human children. In kulkarni2016hierarchical

a list of given subgoals is scheduled whereas our attention model/goal generation is learned. Our work is closely related to

zambaldi2018relational , multi-head self-attention is used to learn non-local relations between entities which are then fed as input into an actor-critic network. In this work, learning these relations is separated from learning the policy and done by a low capacity network. More details are given in Sec. 3. In addition we learn an automatic curriculum.

Task-level planning has been extensively studied in robotics and AI planning communities in form of geometric planning methods (e.g., RRT (lavalle1998rapidly, ), PRM (kavraki1994probabilistic, )) and optimization-based planning siciliano2016springer ; posa2014direct . There is a parallel between the sparse reward problem and optimization-based planning: the lack of gradient information if the robot is not in contact with the object of interest. Notably, our use of the surprise signal is reminiscent to the event-triggered control design heemels2012introduction ; baumann2018deep in the control community and was also proposed in the cognitive sciences butzStructures .

3 Method

We call our method Control What You Can (CWYC). The goal is to make an agent learn to control itself and objects in its environment, or more generically, to control the components of its internal representation. We assume the observable state-space is partitioned into groups of potentially controllable components (coordinates) referred to as goal spaces. Manipulating these components is formulated as self-imposed tasks. These can be the agent’s position (task: move agent to location ()), object positions (task: manipulate object to position ()), etc. The component’s semantics and whether it is controllable are unknown to the agent. Obtaining such a representation from raw sensor data is orthogonal to our investigation. Readers interested in this representation learning are referred to, e. g. , pere2018unsupervised . Formally, the coordinates in the state corresponding to each goal-reaching task are specified by and denoted by . The goal in each task is denoted as . For instance, if task has its goal-space along the coordinates and (e.g. agent’s location) then and are corresponding state values.

During the learning/development phase, the agent can decide which task (e. g. itself or object) it attempts to control. Intuitively, it should be beneficial for the learning algorithm to concentrate on tasks where the agent can make progress in and inferring potential task dependencies. In order to express the capability of controlling a certain object, we consider goal-reaching tasks with randomly selected goals. When the agent can reach any goals allowed by the environment, it has achieved control of the component (e. g. moving a box to any desired location). In many challenging scenarios, e. g. object manipulation, there are “funnel states” that must be discovered. For instance, in a tool-use task the funnel states are where the agent picks up the tool and where the tool touches another object that needs to be manipulated. Our architecture combines relational learning embedded in an overall intrinsically motivated learning framework based on a learned probabilistic graph that chains low-level goal-directed RL controllers.

Our approach contains several components as illustrated in Fig. 1. Their detailed interplay is as follows: The tasks (LABEL:comp:tasks) control groups of components (coordinates) of the state. A task selector (bandit)(LABEL:comp:bandit) is used to select a self-imposed task (final task) maximizing expected learning progress. Given a final task, the task planner (LABEL:comp:bnet) computes a viable subtask sequence (bold) from a learned task graph (LABEL:comp:depgraph). The subgoal generators (LABEL:comp:gnet) (relational attention networks) create continuously goals in each subtask. The goal-conditioned low-level policies for each task control the agent in the environment (LABEL:comp:env). Let us comprise (LABEL:comp:bnet,LABEL:comp:depgraph,LABEL:comp:gnet, ) into the acting policy (internally using the current subtask policy and goal etc). After one rollout different quantities measuring the training progress are computed and stored in the per task history buffer (LABEL:comp:hist). An intrinsic motivation module (LABEL:comp:im) computes the rewards and target signals for (LABEL:comp:bandit), (LABEL:comp:bnet), and (LABEL:comp:gnet) based on learning progress and prediction errors. All components are trained concurrently and without external supervision. Prior knowledge enters only in the form of specifying the goal spaces (groups of coordinates of the state space). The environment allows the agent to select which task to do next and generates a random arrangement with a random goal.

3.1 Intrinsic motivation

In general, our agent is motivated to learn as fast as possible, i. e. to have the highest possible learning progress, and to be as successful as possible in each task. When performing a particular task , with the goal the agent computes the reward for the low-level controller as the negative distance to the goal as and declares success as: where is a precision threshold and is the Iverson bracket. We calculate the following key measures to quantify intrinsic motivations:

[itemsep=0em, topsep=0.1em,parsep=0.1em,labelindent=0pt,leftmargin=5pt]

Success rate (controlability)

, where is the state distribution induced by . In practice

is estimated as a running mean of the last attempts of task


Learning progress

is the time derivative of the success rate, quantifying whether the agent gets better at task compared to earlier attempts.

Initially, any success signals might be so sparse that learning becomes slow because of uninformed exploration. Hence, we employ surprise as a proxy that guides the agent’s attention to tasks and states that might be interesting.

[itemsep=0em, topsep=0.1em,parsep=0.1em,labelindent=0pt,leftmargin=5pt]

Prediction error

in goal space of a forward model trained using squared loss and denotes the error in the goal space .

Surprising events

is if the prediction error in task

exceeds a confidence interval (computed over the history),

otherwise, see also Gumbsch2017:Eventtaxonomies .

To understand why surprising events can be informative, let us consider again our example: Assume the agent just knows how to move itself. It will move around and will not be able to manipulate other parts of its state-space, i. e. it can neither move the heavy box nor the tool. Whenever it accidentally hits the tool, the tool moves and creates a surprise signal in the coordinates of the tool task. Thus, it is likely that this particular situation is a good starting point for solving the tool task and make further explorations.

3.2 Task-planning architecture

The task selector , Fig. 1(LABEL:comp:bandit), models the learning progress when attempting to solve a task. It is implemented as a multi-armed bandit. While no learning progress is available, the surprise signal is used as a proxy. Thus, the internal reward signal for the bandit for a rollout attempting task is


with . The multi-armed bandit is used to chooses the (final) task for a rollout using a stochastic policy. More details can be found in Sec. A.1. In our setup, the corresponding goal within this task is determined by the environment (in a random fashion).

Because difficult tasks require subtasks to be performed in a certain order, a task planner determines the sequence of subtasks. The task planner models how well/quick (sub)task can be solved when performing subtask directly before it. As before, we use surprising events as a proxy signal for potential future success. The values of each task transition is captured by , where and with representing the “start”:


where denotes a running average and is the runtime for solving task by doing task before (maximum number time steps if not successful). Similarly to Eq. 1, this quantity is initially dominated by the surprise signals and later by the actual success values.

The matrix represents the adjacency matrix of the task graph, see Fig. 1(LABEL:comp:depgraph). It is used to construct a sequence of subtasks by starting from the final task and determining the previous subtask with an -greedy policy using . Then this is repeated for the next (prerequisite) subtask, until (start) is sampled (no loops are allowed), see also Fig. 1(LABEL:comp:bnet) and (LABEL:comp:depgraph).

Each (sub)task is itself a goal-reaching problem. In order to decide which subgoals need to be chosen we employ an attention network for each task transition, i. e.  for the transition from task to task . As before, the aim of the goal proposal network is to maximize the success rate of solving task when using the proposed goal in task before. In the example, in order to pick up the tool, the goal of the preceding locomotion task should be the location of the tool. An attention network that can learn relations between observations is required. We use an architecture that models local pairwise distance relationships. It associates a value/attention to each point in the goal-space of the preceding task as a function of the state : : (omitting index )


where , , , and are trainable parameters. The network is trained using square-loss with the following target signal :


for all that occurred during task where is if the switching state from task to task occurred in state and zero otherwise. To get an intuition about the parametrization, consider a particular pair of coordinates , say agent’s and tool’s -coordinate. The model can express with that both have to be at distance zero for to be . However, with the system can also model offsets, global reference points and other relationships. Further details on the architecture and training can be found in Suppl. A.5. We observe that the goal proposal network can learn a relationship after a few examples (in the order of 10), possibly due to the restricted model class. The goal proposal network can be thought of as a relational network santoro2017simple , albeit is easier to train. Sampling a goal from the network is done by computing the maximum analytically as detailed in Suppl. A.3. The low-level control in each task has its own policy learned by soft actor critic (SAC) haarnojaEtAlLevine2018:SAC or DDPG+Her andrychowicz2018hindsight .

(a) tool-use/object manipulation (b) robotic object manipulation
Figure 2: Environments used to test CWYC. (a) basic tool-use/object manipulation environment; (b) robotic object manipulation environment. The hook needs to be used to move the box.

4 Experimental Results

Through experiments in two different environments, we wish to investigate empirically: does the CWYC agent learn efficiently to gain control over the environment? What about challenging tasks that require a sequence of subtasks and uncontrollable objects? How is the behavior of CWYC different from that of other (H)RL agents? To give readers a sense of the computational property of CWYC, we use an implementation111 of HIRO nachum2018data as a baseline which is suitable for continuous control tasks. However it solves each task independently as it does not support the multi-task setting. In addition we show the baselines of using only the low-level controllers (SAC haarnojaEtAlLevine2018:SAC or HER andrychowicz2018hindsight

) for each individual task independently and spend resources on all tasks with equal probability.

We also add CWYC with a hand-crafted oracle task planner () and oracle subgoal generator () denoted as oracle, see Suppl. D.1. The code as well as the environment implementations will be made public with the final version of this paper. The pseudocode is provided in Suppl. B.

Synthetic environment. The synthetic object manipulation arena, as shown in Fig. 2

(a), consists of a point mass agent with two degrees of freedom and several objects surrounded by a wall. It is implemented in the MuJoCo physics simulator

(Todorov2012:Mujoco, ) and has continuous state and action spaces. To make the tasks difficult, we consider the case with 4 different objects: 1. the tool, that can be picked up easily; 2. the heavy object that needs the tool to be moved; 3. an unreliable object denoted as 50% object, that does not respond to control during of the rollouts; and 4. a random object that moves around randomly and cannot be manipulated by the agent, see Fig. 2(a). The detail of the physics in this environment can be found in Suppl. C.1.

  CWYC w oracle    CWYC    HIRO    SAC

(a) (c) (e) (b) (d) (f)     reachability analysis of the heavy object                    
Figure 3: Competence of the agents in controlling all aspects of the synthetic environment. Overall performance (a) (maximal ). Individual task competence in (b-e). HIRO and SAC can only learn the locomotion task. All performance plots (as well in remaining figures) show median and shaded 25% / 75% pecentiles averaged over 10 random seeds. In (c-e) the green curve is below the blue curve. (f) shows the gain in reachability: probability of reaching the point with the heavy box from 20 random starting states (initially zero).

Figure 3 shows the performance of the CWYC-agent compared to the hierarchical baseline (HIRO), non-hierarchical baseline (SAC) and the hand-crafted upper baseline (oracle). The main measure is competence, i. e. the overall success-rate () of controlling the internal state, i. e. reaching a random goal in each task-space. In this setting an average maximum of can be achieved, due to the “random object” and “50% object”. The results show that our method is able to quickly gain control over the environment, also illustrated by the reachability growing with time and reaches almost full coverage for the heavy object. After steps, the agent can control what is controllable. The SAC and HIRO baseline attempts to solve each task independently and spends resources equally between tasks. Both only succeed in the locomotion task. They do not learn to pick up any of the other objects and transport them to a desired location. As a remark, the arena is relatively large such that random encounters are not likely. Providing oracle reward signals makes the baselines (HIRO/SAC) learn to control the tool eventually, but still significantly slower than CWYC, see Fig. 8(d), and the heavy object remains uncontrollable see Suppl. D.2.

Robotic manipulation. The robotic manipulation environment consists of a robotic arm with a gripper (3 + 1 DOF) in front of a table with a hook and a box (at random locations), see Fig. 2(b). The box cannot be reached by the gripper directly. Instead, the robot has to use the hook to manipulate the box. The observed state space is 40 dimensional. The environment is based on the OpenAI Gym (1606.01540, ) robotics environment. The goal-spaces/tasks are defined as (1) reaching a target position with the gripper, (2) manipulating the hook, and (3) manipulation the box. Further details can be found in Suppl. C.2. Compared to the synthetic environment, object relations are much less obvious in this environment. Especially the ones involving the hook because of its asymmetrical shape. This makes learning object relation much harder. For instance, while trying to grasp the hook, the gripper might touch the hook at wrong positions thus failing at manipulation. However, the objects are relatively close to each other leading to more frequent random manipulations. The results are shown in Fig. 4. Asymptotically, both CWYC and the HER baseline manage to solve all three tasks almost perfectly. The other baselines cannot solve it. Regarding the time required to learn the tasks, our method shows a clear advantage over the HER baseline, solving the 2nd and 3rd task 25% times faster.


Figure 4: Success rates of reaching, tool using and object manipulation in the robotic environment. CWYC as well as DDPG+HER learn all three tasks perfectly. CWYC has improved sample complexity. SAC and HIRO learn the reaching task only slowly and the other two tasks not at all.

5 Analysis and ablation: why and how CWYC works

How does the agent gain control of the environment? We start by investigating how the surprising events help to identify the funnel states/relationships – a critical part of our architecture. When the agent is, for instance, involuntarily bumping into a tool, the latter will suddenly move – causing a large prediction error in the forward models in the tool goal-space, see Fig. 5(a,b). Only a few of such surprise observations are needed to make the subgoal generators, Fig. 1(LABEL:comp:gnet), effective, see Fig. 5(c). A more detailed analysis follows below. For further details on the training process, see Suppl. A.5.

(a) trajectory (b) prediction error (c) relation learning
Figure 5: Surprise and relational funnel state learning in the synthetic experiment. (a) the agent bumps into the tool; (b) surprise (prediction error above confidence level) marks the funnel state where tool’s and agent’s position coincide. (c) avg. distance of the generated goals (locomotion targets) from the tool location in dependence of the number of surprising events.
(a) resource allocation (b) task planner (c) learned task graph
Figure 6: Resource allocation and task planning structure in the synthetic environment. (a) resource allocation (Fig. 1(LABEL:comp:bandit)): relative time spend on each task in order to maximize learning progress. (b) Task planner (Fig. 1(LABEL:comp:bnet)): the probabilities of selecting task (column) before task (row). Self-loops (gray) are not permitted. Every sub-task sequence begins in the start state. (c) Learned task graph (Fig. 1(LABEL:comp:depgraph)) derived from (b). The arrows point to the preceding task, which corresponds to the planning direction. The red states show an example plan for moving the heavy box.

Resource allocation is managed by the task selector, Fig. 1(LABEL:comp:bandit), based on maximizing learning progress and surprise, see Eq. 1. As shown in Fig. 6(a), starting from a uniform tasks selection, the agent quickly spends most time on learning locomotion, because it is the task where the agent makes the most progress in, cf Fig. 3. After locomotion has been learned well enough, the agent starts to concentrate on new tasks that require the locomotion skill (moving tool and the “50% object”). Afterwards, the heavy object becomes controllable due to the competence in the tool task (at about steps). The agent automatically shifts its attention to that.

The task selector produces the expected result that simple tasks are solved first and stop getting attention as soon as they cannot be improved more than other tasks. This is in contrast to approaches that are solely based on curiosity/prediction error. When all tasks are controllable (progress plateaus) the 50% object attracts most attention due to randomness in the success rate. As a remark, the learned resource allocation of the oracle agent is similar to that of CWYC.

Next, we study how the agent understands the task structure. ,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=some text passage that might be needed somewhere (in comment) The funnel states,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=Check key relations earlear and define it., discovered above, need to be visited frequently in order to collect data on the indirectly controllable parts of the environment (e. g. tool and heavy box). The natural dependencies between tasks is learned by the task planner , Fig. 1(LABEL:comp:bnet). Initially the dependencies between the subtasks are unknown such that resulting in a probability of selecting a certain preceding subtask (or “start”). After learning, the CWYC agent has found which task needs to be executed before, see Fig. 6(b-c). When executing a plan, subgoals have to be generated. This is where the relational funnel states learned by the subgoal generators (Fig. 1(LABEL:comp:gnet)) come in. The subgoal generators

learn initially from surprising events and attempt to learn the relation among the components of the observation vector. For instance, every time the tool is moved, the agent’s location is close to that of the tool.

initial locomotion tool tool heavy object
Figure 7: Subgoal proposal networks and the learned object relationships. Intuitively, a nonzero value (red) indicates that this relationship is used. For clarity, each panel shows (see Eq. 3) of the relevant goal proposal networks. Example situations with the corresponding generated goal are presented on the right. The locomotion and tool task goals, i.e red and purple flags, are generated by the subgoal generators, Fig. 1(LABEL:comp:gnet), the object goal, i.e orange flag, is sampled randomly.

Figure 7 displays the learned relationships for the subgoal generation for the locomotion tool transition and for the tool heavy object transition. A non-zero value indicates that the corresponding components are involved in the relationship. The full parametrization is visualized and explained in Suppl. A.5. The system identifies that for the tool task the coordinates with the agent and the tool have to coincide. Likewise for the heavy box, the agent, tool, and heavy box have to be in one spot. The goal proposal network updates the current goal every 5 steps by computing the goal with the maximal value, see Suppl. A.5 for more details.

We ablate different components of our architecture to demonstrate their impact on the performance. We remove the surprise detection, indicated as CWYC, and remove the resource allocation (uniform task sampling) denoted as CWYC. Figure 8(a) shows the performance and reveals that the surprise signal is a critical part of the machinery. If removed, reducing the performance to the SAC baseline, i. e. only solves the locomotion task. Figure 8(b,c) provide insight why this is happening. Without the surprise signal, the goal proposal network does not get enough positive training data to learn from; hence, constantly samples random goals prohibiting successful switches which would create additional training data. Logically, the resource allocation speeds up learning and makes learning the hard tasks faster, cf. CWYC and CWYC.

(a) (b) (c) (d)
Figure 8: (a) Performance comparison of ablated versions of CWYC: uniform task selection () and without surprise signal (). (b-c) number of positive training samples for goal network and quality of sampled goals (see Fig. 5) CWYC vs. CWYC. (d) baselines with oracle reward on tool task.

6 Conclusion

We present the control what you can (CWYC) method that makes an autonomous agent learn to control the components of its environment effectively. We adopt a task-planning agent architecture while all components are learned from scratch. Driven by learning progress, the IM agent learns an automatic curriculum which allows it to not invest resources in uncontrollable objects, nor try unproportionally often to improve its performance on not fully solvable tasks. This key feature differentiates CWYC from approaches solely based on curiosity.,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=Need to add the surprise here. Maybe remove last sentence.


Appendix A Details of the method

a.1 Final task selector

,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=straight copy from main text, need to check!

The task selector (LABEL:comp:bandit) models the learning progress when attempting to solve a task and is implemented as a multi-armed bandit. The reward is given in Eq. 1. We use the absolute value of the learning progress because the system should both learn when it can improve, but also if performance degrades [2]. Initially, the surprise term dominates the quantity. As soon as actual progress can be made takes the leading role. The reward is non-stationary and the action-value is updated according to


with learning rate . The task selector is to choose the (final) task for each rollout relative to their value accordingly. We want to maintain exploration, such that we opt for a stochastic policy with .

a.2 Low-level control

Each task has its own policy which is trained separately using an off-policy deep RL algorithm. We use soft actor critic (SAC) [12] in the synthetic environment and DDPG+Her [1] in the robotics environment. Policies and the critic networks are parametrized by the goal (UVFA [29]).

a.3 Subgoal sampling

For each subtask the goal is selected with the maximal value in the attention map. However, coordinates of tasks that are still to be solved in the task-chain are fixed, because they can likely not be controlled by the current policy. Formally:


where is the task-chain and denotes all tasks after and including . selects the coordinates belonging to task , see Sec. 3.1. The goal for subtask is then . This is a convex program and its solution can be computed analytically.

a.4 Intrinsic motivations

For computing the success rate we use a running mean of the last attempts of the particular task:


where denotes the success in the -th last rollout where task was attempted to be solved.

The learning progress is then given as the finite difference of between subsequent attempts of task .

To compute the surprise signal , we compute the statistics of the prediction error over all the collected experience, i. e. we assume


and compute the empirical and . Denoting the finite difference by , surprise within one rollout is then defined as



is a hyperparameter that needs to be choose.

a.5 Training details of the goal proposal network

In an ever-changing environment as the ones presented in this paper, the goal proposal networks are a critical component of our framework that aim to learn relations between entities in the world. Transitions observed in the environment are labeled by the agent in interesting and undetermined transitions. Interesting transitions are those, in which a surprising event (high prediction error) occurs or which lead to an success in task given some other task was solved before, see Eq. 4. All other transitions are labeled as undetermined, since they might contain transition which are similar to those that are labeled interesting but didn’t spark high interest. Coming back to our running example: bumping into, hence suddenly moving, the tool might spark interest in the tool because of a suddenly jump in prediction error. In general, the behaviour of an object after the surprising event is unknown and label for these transitions is not clear. Conclusively, we discard all undetermined transition within a rollout that come after a transition with positive label.

After removing all data that might prevent the goal proposal networks from learning the right relations it remains the problem that positive events are rare compared to the massive body of undetermined data. Hence, we balance the training data in each batch during training.

To make efficient use of the few positive samples we collect in the beginning of the training we impose a structural prior on the goal proposal network given by Eq. 3. The weight matrices are depicted in Fig. 9. This particular structure restricts the hypothesis space of the component to positional relations between components in the observation space that contains entities in the environment. In the main text, Figure 7 shows a compact representation of the initial and final weight matrices for different tasks that are computed by taking the minimum over (left column) and (middle column) in Fig. 9.


Locomotion Tool

Tool Heavy object

Figure 9: Weights learned by goal proposal networks for different task transitions. The left column shows the weights of , the middle column of and the right column of (see Eq. 3).

To understand the parametrization, consider to model that two components of should have the same value for a possitive signal, then should be nonzero and . In this case the corresponding term in the exponent of Eq. 4 is zero if . We see that in the case of the learned in Fig. 9 this relationship is true for the relevant components (position of agent, tool and object).

a.6 Training / overall procedure

All components of CWYC start in a complete uninformed state. A rollout starts by randomly scramble the environment. The (final) task is chosen by the task selector. The task planner constructs the task chain . Every 5 steps in the environment, the goal proposal networks computes a goal for the current task. Given the subgoal the goal-parametric policy of that task is used. Whenever the goal is reached (up to a certain precision) a switch to the next task occurs. Again the goal proposal network is employed to select a goal in this task, unless it is the final task where the final goal is obviously used. If a goal cannot be reached the task ends after steps. In practice we run 5 rollouts in parallel. Then all components are trained using the collected data. For the task selector and task planner we use Eq. 5 and Eq. 2, respectively. Forward model and s are trained using square-loss and Adam [17]. The policies are trained according to SAC/DDPG+HER. Pseudo-code and implementation details can be found in Sections (B, F).

Appendix B Pseudocode

The pseudocode for the method is given in Algorithm 1.

1:  for episode in episodes do
2:     sample main task
3:     sample main goal from environment
4:     compute task chain using starting from
5:      // COMMENT contains list task indices
7:     while  and no success in  do
9:        if  then
10:           sample goal from // COMMENTEq. 6
11:        end if
12:        try to reach with policy
13:        if succ then
14:            // COMMENTnext task in task chain
15:        end if
16:     end while
17:     store episode in history buffer
18:     calculate statistics based on history
19:     train policies for each task
20:     train B // COMMENTSec. 3.2
21:     train all G // COMMENTSec. 3.2
22:     train // COMMENTSec. 3.2
23:  end for
Algorithm 1 CWYC

Appendix C Environments

c.1 Synthetic environment

The synthetic environment is depicted in Fig. 2a and is simulated by the physics engine MuJoCo. The agent is modeled by a ball that is controlled by applying force in the and axis, so the agent’s action corresponds to a 2-dimensional vector:


The motion of the agent is subject to the laws of motion with the application of friction from the environment which makes it non-trivial to control. Other than the agent, the environment contains objects with different dynamics. The positions of the objects are part of the observation space of the agent along with a flag that specifies if the object has been picked up by the agent. We are dealing with a fully observable environment.

We define the goal spaces of the tasks as corresponding to the position of the individual objects. Some objects are harder to move than others and have other objects as dependencies. This means that the agent has to find this relation between them in order to successfully master the environment.

The types of objects that are used in the experiments are the following:

  • Static objects cannot be moved

  • Random objects move randomly in the environment, but cannot be moved by the agent

  • 50% light objects can be moved in 50% of the rollouts

  • Tool can be moved and used to move the heavy object

  • Heavy objects can be moved when using the tool

The observation vector for objects is structured as follows , where is the position of the agent, is the position of the -th object and indicates whether the agent is in possession of the -th object. The goal spaces are the coordinates of the agent and the coordinates of each object .

c.2 Robotic environment

The robotic environment is depicted in Fig. 2c. The state space is 40 dimensional. It consists of the agent position and velocity, the gripper state, the absolute and relative positions of the box and the hook, respectively, as well as their velocities and rotations.

The environment is based on the OpenAI gym [4] PickAndPlace-v1 environment.

Final goals for the reach and tool task are sampled close to the initial gripper and tool location, respectively. Final goals for the object task are spawned in close proximity to the initial box position such that the box needs to be pulled closer to the robot but never pushed away. The box is spawned in close proximity to (closer to the robot) the upper end of the hook.

Appendix D Oracle baselines

d.1 CWYC with oracle goals

To assess the maximum performance of CWYC in the described settings, we crafted an upper baseline in which all learned high-level components, except for the final task selector , are fixed and set to their optimal value.

In the distractor setting, every task is solved by first doing the locomotion task. The goal proposal network returns always the state value , reflecting the ground truth relation we try to learn.

In the synthetic tool-use setting, the task graph depicted in Figure 10 is used. The goal proposal network returns always the state value , reflecting the ground truth relation we try to learn.

Figure 10: Oracle dependency graph.

d.2 HIRO/SAC with oracle reward

To see if HIRO manages to solve the synthetic environment at all, we constructed a oracle version of HIRO. The oracle receives as input not only the distance from, e.g., tool to target position but additionally the distance from agent to tool. This signal is rich enough to allow HIRO to solve the tool manipulation task as shown in Fig. 8(d) in the main text, although it still takes a lot of time compared to CWYC. We trained the SAC baseline on the same hybrid reward as well.

Appendix E Additional analysis of the ablation studies

Without the surprise signal CWYC neither learns a meaningful resource allocation schedule, see Fig. 11(a), nor a task dependency graph, see Fig. 11(b). This highlights again the critical role of the surprise signal.

(a) ressource allocation (b) task dependency graph
Figure 11: (a) Ressource allocation and (b) task dependency graph for the ablated version CWYC. In (a) all tasks except locomotion behave identically because no progress is made.

Appendix F Training Details and Parameters

f.1 Synthetic environment

  • Training:

    # parallel rollout workers: 5
  • Environment:

    arena size:
    : 1600
    : 1.0
  • SAC:

    batch size: 64
    policy type: gaussian
    discount: 0.99
    reward scale: 5
    target update interval: 1
    tau (soft update)
    action prior: uniform
    layer size (): 256
    # layers (): 2
    # train iterations: 200
    buffer size:
  • Forward model:

    batch size: 64
    confidence interval: 5
    network type: MLP
    layer size: 100
    : 5
    # layers: 9
    # train iterations: 100
  • Final task selector:

    random_eps: 0.05
    surprise history weighting: 0.99
  • Task planner:

    avg. window size: 100
    surprise history weighting: 0.99
    sampling_eps: 0.05
  • Goal proposal network:

    batch size: 64
    L1 reg.: 0.0
    L2 reg.: 0.0
    init: 1.0
    trainable: True
    # train iterations: 100

f.2 Robotic environment

  • Training:

    # parallel rollout workers: 5
  • Environment:

    : 150
    : 0.05

    batch size: 256
    layer size (): 256
    # layers (): 3
    # train iterations: 80
    buffer size:
    action_l2: 1.0
    relative goals: false
    replay strategy: future
    replay_k: 4
    random_eps: 0.3
    noise_eps: 0.2
  • Forward model:

    batch size: 64
    confidence interval: 3
    network type: MLP
    layer size: 100
    : 3
    # layers: 9
    # train iterations: 100
  • Final task selector:

    random_eps: 0.05
    surprise history weighting: 0.99
  • Task planner:

    avg. window size: 100
    surprise history weighting: 0.99
    sampling_eps: 0.05
  • Goal proposal network:

    batch size: 64
    L1 reg.: 0.0
    L2 reg.: 0.0
    init: 1.0
    trainable: True
    # train iterations: 30