This paper studies how to make an autonomous agent learn to gain maximal control of its environment under little external reward. To answer this question, we turn to the true learning experts: children. How do they solve this problem? They play, often with any objects within their reach. The purpose may not be immediately clear to us. But to play is to manipulate, to gain control. In the same spirit of this cognitive developmental process, we specifically design an agent that is 1) intrinsically motivated to gain control of the environment 2) capable of learning its own curriculum and to reason about object relations in a way that was not done before.
As a motivational example, consider an environment with a heavy object that cannot be moved without using a tool such as a forklift, as depicted in Fig. 1(LABEL:comp:env). The agent needs to be able to control itself and the tool, and use it to move the heavy object. In the beginning, we do not assume the agent has knowledge of the tool, object, or physics. It needs to learn from scratch that is highly challenging for current algorithms. Without external rewards, an agent may be driven by intrinsic motivation (IM) to gain control over its own internal representation of the world, which includes itself and objects in the environment. It often faces a decision of what to attempt to learn with limited time and attention: if there are several objects that can be manipulated, which one should be dealt with first? In our approach the scheduling is solved by an automatic curriculum that aims at improving learning progress. The learning progress, may have its unique advantage over other quantities such as prediction error (curiosity): it renders unsolvable tasks uninteresting as soon as progress stalls.
Instead of an end-to-end architecture, we adopt a core reasoning structure about tasks and subgoals. Inspired by the task-level planning methods from the robotics and AI planning communities, we model the agent using planning architecture in the form of chained subtasks. In practice, this is modeled as a task graph as in Fig. 1(LABEL:comp:depgraph). In order to perform manipulation, the agent and the tool need to be in a specific relation. Our agent learns such relationships by an attention mechanism bootstrapped by surprise detection.
Our main contributions are:
We propose to use maximizing controllability and surprise as intrinsic motivation for solving challenging control problems. The computational effectiveness of this cognitive development-inspired approach is empirically demonstrated.
We propose to adopt several task-level planning ideas (backtracking search on task graph/goal regression, probabilistic road-maps, allocation of search efforts) for designing IM agents to achieve task completion and skill acquisition from scratch.
To our knowledge, no prior IM study has adopted similar controllability and task-planning insights. The contributions are validated through 1) a synthetic environment, with exhaustive analysis and ablation, that cannot be solved by state-of-the-art methods even with oracle rewards; 2) a robotic manipulation environment where tool manipulation is necessary.
2 Related Work
In this section, we give a survey on the recent computational approaches to intrinsic motivation (IM). This is by no means comprehensive due to the large body of literature on this topic. Generally speaking, there are a few types of IM in literature: learning progress (competence, empowerment), curiosity (surprise, prediction error), self-play (adversarial generation), auxiliary tasks, maximizing information theoretical quantities, etc. To help readers clearly understand the relation between our work and the literature, we provide the following table.
|Intrinsic motivation||Computational methods|
|CWYC Ours||learning progress + surprise||task-level planning, relational attention|
|h-DQN kulkarni2016hierarchical||reaching subgoals||HRL, DQN|
|IMGEP forestier2017intrinsically||learning progress||memory-based|
|CURIOUS Colas2018:CURIOUS||learning progress||DDPG, HER, E-UVFA|
|SAC-X riedmiller2018learning||auxiliary task||HRL, (DDPG-like) PI|
|Relational RL zambaldi2018relational||-||relation net, IMPALA|
|ICM pathak2017curiosity||prediction error||A3C, ICM|
|Goal GAN florensa2018automatic||adversarial goal||GAN, TRPO|
|Asymmetric self-play sukhbaatar2017intrinsic||self-play||Alice/Bob, TRPO, REINFORCE|
describes the rate of change in agent’s gaining competence in certain skills. It is a heuristic for measuring interests inspired by observing human children. This is the focus of many recent studies(schmidhuber1991possibility, ; Kaplan2004Maximizing-Learning-Progress:, ; Schmidhuber:06cs, ; Oudeyer2007Intrinsic-Motivation-Systems, ; BaranesOudeyer2013:ActiveGoalExploration, ; forestier2017intrinsically, ; Colas2018:CURIOUS, ). Our work can be thought of as an instantiation using maximizing controllability and a task-planning structure. Empowerment klyubin2005empowerment proposes the quantity that measures the control of the agent over its future sensory input. Curiosity, as a form of IM, is usually modeled as the prediction error by the agent’s world model. For example, in challenging video game domains it can lead to remarkable success pathak2017curiosity or to learn options chentanez2005intrinsically . self-play as IM, where two agents engage in an adversarial game, was demonstrated to increase learning speed sukhbaatar2017intrinsic . This is also related to the idea of using GANs for goal generation as in florensa2018automatic .
Recently, auxiliary prediction tasks were used to aid representation learningjaderberg2016reinforcement . In comparison, our goal is not to train the feature representation but to study the developmental process; our concept of controllability is also unique. Similarly, informed auxiliary tasks as a form of IM was considered in riedmiller2018learning . Many RL tasks can be formulated as aiming to maximize certain information theoretical quantities sunyi2011agi ; Little2013Learning-and-exploration-in-action-perception ; ZahediMartiusAy2013 ; haarnoja2017reinforcement . In contrast, we focus on IM inspired by human children. In kulkarni2016hierarchical
a list of given subgoals is scheduled whereas our attention model/goal generation is learned. Our work is closely related tozambaldi2018relational , multi-head self-attention is used to learn non-local relations between entities which are then fed as input into an actor-critic network. In this work, learning these relations is separated from learning the policy and done by a low capacity network. More details are given in Sec. 3. In addition we learn an automatic curriculum.
Task-level planning has been extensively studied in robotics and AI planning communities in form of geometric planning methods (e.g., RRT (lavalle1998rapidly, ), PRM (kavraki1994probabilistic, )) and optimization-based planning siciliano2016springer ; posa2014direct . There is a parallel between the sparse reward problem and optimization-based planning: the lack of gradient information if the robot is not in contact with the object of interest. Notably, our use of the surprise signal is reminiscent to the event-triggered control design heemels2012introduction ; baumann2018deep in the control community and was also proposed in the cognitive sciences butzStructures .
We call our method Control What You Can (CWYC). The goal is to make an agent learn to control itself and objects in its environment, or more generically, to control the components of its internal representation. We assume the observable state-space is partitioned into groups of potentially controllable components (coordinates) referred to as goal spaces. Manipulating these components is formulated as self-imposed tasks. These can be the agent’s position (task: move agent to location ()), object positions (task: manipulate object to position ()), etc. The component’s semantics and whether it is controllable are unknown to the agent. Obtaining such a representation from raw sensor data is orthogonal to our investigation. Readers interested in this representation learning are referred to, e. g. , pere2018unsupervised . Formally, the coordinates in the state corresponding to each goal-reaching task are specified by and denoted by . The goal in each task is denoted as . For instance, if task has its goal-space along the coordinates and (e.g. agent’s location) then and are corresponding state values.
During the learning/development phase, the agent can decide which task (e. g. itself or object) it attempts to control. Intuitively, it should be beneficial for the learning algorithm to concentrate on tasks where the agent can make progress in and inferring potential task dependencies. In order to express the capability of controlling a certain object, we consider goal-reaching tasks with randomly selected goals. When the agent can reach any goals allowed by the environment, it has achieved control of the component (e. g. moving a box to any desired location). In many challenging scenarios, e. g. object manipulation, there are “funnel states” that must be discovered. For instance, in a tool-use task the funnel states are where the agent picks up the tool and where the tool touches another object that needs to be manipulated. Our architecture combines relational learning embedded in an overall intrinsically motivated learning framework based on a learned probabilistic graph that chains low-level goal-directed RL controllers.
Our approach contains several components as illustrated in Fig. 1. Their detailed interplay is as follows: The tasks (LABEL:comp:tasks) control groups of components (coordinates) of the state. A task selector (bandit)(LABEL:comp:bandit) is used to select a self-imposed task (final task) maximizing expected learning progress. Given a final task, the task planner (LABEL:comp:bnet) computes a viable subtask sequence (bold) from a learned task graph (LABEL:comp:depgraph). The subgoal generators (LABEL:comp:gnet) (relational attention networks) create continuously goals in each subtask. The goal-conditioned low-level policies for each task control the agent in the environment (LABEL:comp:env). Let us comprise (LABEL:comp:bnet,LABEL:comp:depgraph,LABEL:comp:gnet, ) into the acting policy (internally using the current subtask policy and goal etc). After one rollout different quantities measuring the training progress are computed and stored in the per task history buffer (LABEL:comp:hist). An intrinsic motivation module (LABEL:comp:im) computes the rewards and target signals for (LABEL:comp:bandit), (LABEL:comp:bnet), and (LABEL:comp:gnet) based on learning progress and prediction errors. All components are trained concurrently and without external supervision. Prior knowledge enters only in the form of specifying the goal spaces (groups of coordinates of the state space). The environment allows the agent to select which task to do next and generates a random arrangement with a random goal.
3.1 Intrinsic motivation
In general, our agent is motivated to learn as fast as possible, i. e. to have the highest possible learning progress, and to be as successful as possible in each task. When performing a particular task , with the goal the agent computes the reward for the low-level controller as the negative distance to the goal as and declares success as: where is a precision threshold and is the Iverson bracket. We calculate the following key measures to quantify intrinsic motivations:
- Success rate (controlability)
, where is the state distribution induced by . In practice
is estimated as a running mean of the last attempts of task.
- Learning progress
is the time derivative of the success rate, quantifying whether the agent gets better at task compared to earlier attempts.
Initially, any success signals might be so sparse that learning becomes slow because of uninformed exploration. Hence, we employ surprise as a proxy that guides the agent’s attention to tasks and states that might be interesting.
- Prediction error
in goal space of a forward model trained using squared loss and denotes the error in the goal space .
- Surprising events
To understand why surprising events can be informative, let us consider again our example: Assume the agent just knows how to move itself. It will move around and will not be able to manipulate other parts of its state-space, i. e. it can neither move the heavy box nor the tool. Whenever it accidentally hits the tool, the tool moves and creates a surprise signal in the coordinates of the tool task. Thus, it is likely that this particular situation is a good starting point for solving the tool task and make further explorations.
3.2 Task-planning architecture
The task selector , Fig. 1(LABEL:comp:bandit), models the learning progress when attempting to solve a task. It is implemented as a multi-armed bandit. While no learning progress is available, the surprise signal is used as a proxy. Thus, the internal reward signal for the bandit for a rollout attempting task is
with . The multi-armed bandit is used to chooses the (final) task for a rollout using a stochastic policy. More details can be found in Sec. A.1. In our setup, the corresponding goal within this task is determined by the environment (in a random fashion).
Because difficult tasks require subtasks to be performed in a certain order, a task planner determines the sequence of subtasks. The task planner models how well/quick (sub)task can be solved when performing subtask directly before it. As before, we use surprising events as a proxy signal for potential future success. The values of each task transition is captured by , where and with representing the “start”:
where denotes a running average and is the runtime for solving task by doing task before (maximum number time steps if not successful). Similarly to Eq. 1, this quantity is initially dominated by the surprise signals and later by the actual success values.
The matrix represents the adjacency matrix of the task graph, see Fig. 1(LABEL:comp:depgraph). It is used to construct a sequence of subtasks by starting from the final task and determining the previous subtask with an -greedy policy using . Then this is repeated for the next (prerequisite) subtask, until (start) is sampled (no loops are allowed), see also Fig. 1(LABEL:comp:bnet) and (LABEL:comp:depgraph).
Each (sub)task is itself a goal-reaching problem. In order to decide which subgoals need to be chosen we employ an attention network for each task transition, i. e. for the transition from task to task . As before, the aim of the goal proposal network is to maximize the success rate of solving task when using the proposed goal in task before. In the example, in order to pick up the tool, the goal of the preceding locomotion task should be the location of the tool. An attention network that can learn relations between observations is required. We use an architecture that models local pairwise distance relationships. It associates a value/attention to each point in the goal-space of the preceding task as a function of the state : : (omitting index )
where , , , and are trainable parameters. The network is trained using square-loss with the following target signal :
for all that occurred during task where is if the switching state from task to task occurred in state and zero otherwise. To get an intuition about the parametrization, consider a particular pair of coordinates , say agent’s and tool’s -coordinate. The model can express with that both have to be at distance zero for to be . However, with the system can also model offsets, global reference points and other relationships. Further details on the architecture and training can be found in Suppl. A.5. We observe that the goal proposal network can learn a relationship after a few examples (in the order of 10), possibly due to the restricted model class. The goal proposal network can be thought of as a relational network santoro2017simple , albeit is easier to train. Sampling a goal from the network is done by computing the maximum analytically as detailed in Suppl. A.3. The low-level control in each task has its own policy learned by soft actor critic (SAC) haarnojaEtAlLevine2018:SAC or DDPG+Her andrychowicz2018hindsight .
|(a) tool-use/object manipulation||(b) robotic object manipulation|
4 Experimental Results
Through experiments in two different environments, we wish to investigate empirically: does the CWYC agent learn efficiently to gain control over the environment? What about challenging tasks that require a sequence of subtasks and uncontrollable objects? How is the behavior of CWYC different from that of other (H)RL agents? To give readers a sense of the computational property of CWYC, we use an implementation111https://github.com/n0c1urne/hrl.git of HIRO nachum2018data as a baseline which is suitable for continuous control tasks. However it solves each task independently as it does not support the multi-task setting. In addition we show the baselines of using only the low-level controllers (SAC haarnojaEtAlLevine2018:SAC or HER andrychowicz2018hindsight
) for each individual task independently and spend resources on all tasks with equal probability.
We also add CWYC with a hand-crafted oracle task planner () and oracle subgoal generator () denoted as oracle, see Suppl. D.1. The code as well as the environment implementations will be made public with the final version of this paper. The pseudocode is provided in Suppl. B.
Synthetic environment. The synthetic object manipulation arena, as shown in Fig. 2
(a), consists of a point mass agent with two degrees of freedom and several objects surrounded by a wall. It is implemented in the MuJoCo physics simulator(Todorov2012:Mujoco, ) and has continuous state and action spaces. To make the tasks difficult, we consider the case with 4 different objects: 1. the tool, that can be picked up easily; 2. the heavy object that needs the tool to be moved; 3. an unreliable object denoted as 50% object, that does not respond to control during of the rollouts; and 4. a random object that moves around randomly and cannot be manipulated by the agent, see Fig. 2(a). The detail of the physics in this environment can be found in Suppl. C.1.
Figure 3 shows the performance of the CWYC-agent compared to the hierarchical baseline (HIRO), non-hierarchical baseline (SAC) and the hand-crafted upper baseline (oracle). The main measure is competence, i. e. the overall success-rate () of controlling the internal state, i. e. reaching a random goal in each task-space. In this setting an average maximum of can be achieved, due to the “random object” and “50% object”. The results show that our method is able to quickly gain control over the environment, also illustrated by the reachability growing with time and reaches almost full coverage for the heavy object. After steps, the agent can control what is controllable. The SAC and HIRO baseline attempts to solve each task independently and spends resources equally between tasks. Both only succeed in the locomotion task. They do not learn to pick up any of the other objects and transport them to a desired location. As a remark, the arena is relatively large such that random encounters are not likely. Providing oracle reward signals makes the baselines (HIRO/SAC) learn to control the tool eventually, but still significantly slower than CWYC, see Fig. 8(d), and the heavy object remains uncontrollable see Suppl. D.2.
Robotic manipulation. The robotic manipulation environment consists of a robotic arm with a gripper (3 + 1 DOF) in front of a table with a hook and a box (at random locations), see Fig. 2(b). The box cannot be reached by the gripper directly. Instead, the robot has to use the hook to manipulate the box. The observed state space is 40 dimensional. The environment is based on the OpenAI Gym (1606.01540, ) robotics environment. The goal-spaces/tasks are defined as (1) reaching a target position with the gripper, (2) manipulating the hook, and (3) manipulation the box. Further details can be found in Suppl. C.2. Compared to the synthetic environment, object relations are much less obvious in this environment. Especially the ones involving the hook because of its asymmetrical shape. This makes learning object relation much harder. For instance, while trying to grasp the hook, the gripper might touch the hook at wrong positions thus failing at manipulation. However, the objects are relatively close to each other leading to more frequent random manipulations. The results are shown in Fig. 4. Asymptotically, both CWYC and the HER baseline manage to solve all three tasks almost perfectly. The other baselines cannot solve it. Regarding the time required to learn the tasks, our method shows a clear advantage over the HER baseline, solving the 2nd and 3rd task 25% times faster.
5 Analysis and ablation: why and how CWYC works
How does the agent gain control of the environment? We start by investigating how the surprising events help to identify the funnel states/relationships – a critical part of our architecture. When the agent is, for instance, involuntarily bumping into a tool, the latter will suddenly move – causing a large prediction error in the forward models in the tool goal-space, see Fig. 5(a,b). Only a few of such surprise observations are needed to make the subgoal generators, Fig. 1(LABEL:comp:gnet), effective, see Fig. 5(c). A more detailed analysis follows below. For further details on the training process, see Suppl. A.5.
Resource allocation is managed by the task selector, Fig. 1(LABEL:comp:bandit), based on maximizing learning progress and surprise, see Eq. 1. As shown in Fig. 6(a), starting from a uniform tasks selection, the agent quickly spends most time on learning locomotion, because it is the task where the agent makes the most progress in, cf Fig. 3. After locomotion has been learned well enough, the agent starts to concentrate on new tasks that require the locomotion skill (moving tool and the “50% object”). Afterwards, the heavy object becomes controllable due to the competence in the tool task (at about steps). The agent automatically shifts its attention to that.
The task selector produces the expected result that simple tasks are solved first and stop getting attention as soon as they cannot be improved more than other tasks. This is in contrast to approaches that are solely based on curiosity/prediction error. When all tasks are controllable (progress plateaus) the 50% object attracts most attention due to randomness in the success rate. As a remark, the learned resource allocation of the oracle agent is similar to that of CWYC.
Next, we study how the agent understands the task structure. ,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=some text passage that might be needed somewhere (in comment) The funnel states,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=Check key relations earlear and define it., discovered above, need to be visited frequently in order to collect data on the indirectly controllable parts of the environment (e. g. tool and heavy box). The natural dependencies between tasks is learned by the task planner , Fig. 1(LABEL:comp:bnet). Initially the dependencies between the subtasks are unknown such that resulting in a probability of selecting a certain preceding subtask (or “start”). After learning, the CWYC agent has found which task needs to be executed before, see Fig. 6(b-c). When executing a plan, subgoals have to be generated. This is where the relational funnel states learned by the subgoal generators (Fig. 1(LABEL:comp:gnet)) come in. The subgoal generators
learn initially from surprising events and attempt to learn the relation among the components of the observation vector. For instance, every time the tool is moved, the agent’s location is close to that of the tool.
Figure 7 displays the learned relationships for the subgoal generation for the locomotion tool transition and for the tool heavy object transition. A non-zero value indicates that the corresponding components are involved in the relationship. The full parametrization is visualized and explained in Suppl. A.5. The system identifies that for the tool task the coordinates with the agent and the tool have to coincide. Likewise for the heavy box, the agent, tool, and heavy box have to be in one spot. The goal proposal network updates the current goal every 5 steps by computing the goal with the maximal value, see Suppl. A.5 for more details.
We ablate different components of our architecture to demonstrate their impact on the performance. We remove the surprise detection, indicated as CWYC, and remove the resource allocation (uniform task sampling) denoted as CWYC. Figure 8(a) shows the performance and reveals that the surprise signal is a critical part of the machinery. If removed, reducing the performance to the SAC baseline, i. e. only solves the locomotion task. Figure 8(b,c) provide insight why this is happening. Without the surprise signal, the goal proposal network does not get enough positive training data to learn from; hence, constantly samples random goals prohibiting successful switches which would create additional training data. Logically, the resource allocation speeds up learning and makes learning the hard tasks faster, cf. CWYC and CWYC.
We present the control what you can (CWYC) method that makes an autonomous agent learn to control the components of its environment effectively. We adopt a task-planning agent architecture while all components are learned from scratch. Driven by learning progress, the IM agent learns an automatic curriculum which allows it to not invest resources in uncontrollable objects, nor try unproportionally often to improve its performance on not fully solvable tasks. This key feature differentiates CWYC from approaches solely based on curiosity.,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=Need to add the surprise here. Maybe remove last sentence.
-  Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
-  Adrien Baranes and Pierre-Yves Oudeyer. Active Learning of Inverse Models with Intrinsically Motivated Goal Exploration in Robots. Robotics and Autonomous Systems, 61(1):69–73, January 2013.
-  Dominik Baumann, Jia-Jie Zhu, Georg Martius, and Sebastian Trimpe. Deep reinforcement learning for event-triggered control. arXiv preprint arXiv:1809.05152, 2018.
-  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
-  Martin V. Butz. Which structures are out there. In Thomas K. Metzinger and Wanja Wiese, editors, Philosophy and Predictive Processing, chapter 8. MIND Group, Frankfurt am Main, 2017.
Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh.
Intrinsically motivated reinforcement learning.In Advances in neural information processing systems, pages 1281–1288, 2005.
-  Cédric Colas, Pierre Fournier, Olivier Sigaud, and Pierre-Yves Oudeyer. Curious: Intrinsically motivated multi-task, multi-goal reinforcement learning, 2018. arXiv preprint https://arxiv.org/abs/1810.06284.
Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel.
Automatic goal generation for reinforcement learning agents.
International Conference on Machine Learning, pages 1514–1523, 2018.
-  Sébastien Forestier, Yoan Mollard, and Pierre-Yves Oudeyer. Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190, 2017.
-  Christian Gumbsch, Sebastian Otte, and Martin V Butz. A computational model for the dynamical learning of event taxonomies. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society, pages 452–457, 2017.
-  Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.
-  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of PMLR, pages 1861–1870. PMLR, 10–15 Jul 2018.
-  WPMH Heemels, Karl Henrik Johansson, and Paulo Tabuada. An introduction to event-triggered and self-triggered control. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 3270–3285. IEEE, 2012.
-  Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
F. Kaplan and P.-Y. Oudeyer.
Maximizing learning progress: An internal reward system for
Embodied Artificial Intelligence, pages 259–270, 2004.
-  Lydia Kavraki, Petr Svestka, and Mark H Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces, volume 1994. Unknown Publisher, 1994.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In in Proceedings of ICLR, 2015. arXiv preprint https://arxiv.org/abs/1412.6980.
-  Alexander S Klyubin, Daniel Polani, and Chrystopher L Nehaniv. Empowerment: A universal agent-centric measure of control. In Evolutionary Computation, 2005. The 2005 IEEE Congress on, volume 1, pages 128–135. IEEE, 2005.
-  Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 3675–3683, 2016.
-  Steven M LaValle. Rapidly-exploring random trees: A new tool for path planning. 1998.
-  D. Y. Little and F. T. Sommer. Learning and exploration in action-perception loops. Frontiers in Neural Circuits, 7(37), 2013.
-  Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018.
-  P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner. Intrinsic motivation systems for autonomous mental development. IEEE Trans. on Evo. Computation, 11(2):265–286, 2007.
-  Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
-  Alexandre Péré, Sébastien Forestier, Olivier Sigaud, and Pierre-Yves Oudeyer. Unsupervised learning of goal spaces for intrinsically motivated goal exploration. arXiv preprint arXiv:1803.00781, 2018.
-  Michael Posa, Cecilia Cantu, and Russ Tedrake. A direct method for trajectory optimization of rigid bodies through contact. The International Journal of Robotics Research, 33(1):69–81, 2014.
-  Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing-solving sparse reward tasks from scratch. arXiv preprint arXiv:1802.10567, 2018.
Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan
Pascanu, Peter Battaglia, and Timothy Lillicrap.
A simple neural network module for relational reasoning.In Advances in neural information processing systems, pages 4967–4976, 2017.
-  Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312–1320, 2015.
-  J. Schmidhuber. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connection Science, 18(2):173–187, 2006.
-  Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991.
-  Bruno Siciliano and Oussama Khatib. Springer handbook of robotics. Springer, 2016.
-  Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.
-  E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, Oct 2012.
-  S. Yi, F. Gomez, and J. Schmidhuber. Planning to be surprised: Optimal Bayesian exploration in dynamic environments. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google, Mountain View, CA, 2011.
-  Keyan Zahedi, Georg Martius, and Nihat Ay. Linear combination of one-step predictive information with an external reward in an episodic policy gradient setting: a critical analysis. Frontiers in Psychology, 4(801), 2013.
-  Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830, 2018.
Appendix A Details of the method
a.1 Final task selector,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=straight copy from main text, need to check!
The task selector (LABEL:comp:bandit) models the learning progress when attempting to solve a task and is implemented as a multi-armed bandit. The reward is given in Eq. 1. We use the absolute value of the learning progress because the system should both learn when it can improve, but also if performance degrades . Initially, the surprise term dominates the quantity. As soon as actual progress can be made takes the leading role. The reward is non-stationary and the action-value is updated according to
with learning rate . The task selector is to choose the (final) task for each rollout relative to their value accordingly. We want to maintain exploration, such that we opt for a stochastic policy with .
a.2 Low-level control
a.3 Subgoal sampling
For each subtask the goal is selected with the maximal value in the attention map. However, coordinates of tasks that are still to be solved in the task-chain are fixed, because they can likely not be controlled by the current policy. Formally:
where is the task-chain and denotes all tasks after and including . selects the coordinates belonging to task , see Sec. 3.1. The goal for subtask is then . This is a convex program and its solution can be computed analytically.
a.4 Intrinsic motivations
For computing the success rate we use a running mean of the last attempts of the particular task:
where denotes the success in the -th last rollout where task was attempted to be solved.
The learning progress is then given as the finite difference of between subsequent attempts of task .
To compute the surprise signal , we compute the statistics of the prediction error over all the collected experience, i. e. we assume
and compute the empirical and . Denoting the finite difference by , surprise within one rollout is then defined as
is a hyperparameter that needs to be choose.
a.5 Training details of the goal proposal network
In an ever-changing environment as the ones presented in this paper, the goal proposal networks are a critical component of our framework that aim to learn relations between entities in the world. Transitions observed in the environment are labeled by the agent in interesting and undetermined transitions. Interesting transitions are those, in which a surprising event (high prediction error) occurs or which lead to an success in task given some other task was solved before, see Eq. 4. All other transitions are labeled as undetermined, since they might contain transition which are similar to those that are labeled interesting but didn’t spark high interest. Coming back to our running example: bumping into, hence suddenly moving, the tool might spark interest in the tool because of a suddenly jump in prediction error. In general, the behaviour of an object after the surprising event is unknown and label for these transitions is not clear. Conclusively, we discard all undetermined transition within a rollout that come after a transition with positive label.
After removing all data that might prevent the goal proposal networks from learning the right relations it remains the problem that positive events are rare compared to the massive body of undetermined data. Hence, we balance the training data in each batch during training.
To make efficient use of the few positive samples we collect in the beginning of the training we impose a structural prior on the goal proposal network given by Eq. 3. The weight matrices are depicted in Fig. 9. This particular structure restricts the hypothesis space of the component to positional relations between components in the observation space that contains entities in the environment. In the main text, Figure 7 shows a compact representation of the initial and final weight matrices for different tasks that are computed by taking the minimum over (left column) and (middle column) in Fig. 9.
To understand the parametrization, consider to model that two components of should have the same value for a possitive signal, then should be nonzero and . In this case the corresponding term in the exponent of Eq. 4 is zero if . We see that in the case of the learned in Fig. 9 this relationship is true for the relevant components (position of agent, tool and object).
a.6 Training / overall procedure
All components of CWYC start in a complete uninformed state. A rollout starts by randomly scramble the environment. The (final) task is chosen by the task selector. The task planner constructs the task chain . Every 5 steps in the environment, the goal proposal networks computes a goal for the current task. Given the subgoal the goal-parametric policy of that task is used. Whenever the goal is reached (up to a certain precision) a switch to the next task occurs. Again the goal proposal network is employed to select a goal in this task, unless it is the final task where the final goal is obviously used. If a goal cannot be reached the task ends after steps. In practice we run 5 rollouts in parallel. Then all components are trained using the collected data. For the task selector and task planner we use Eq. 5 and Eq. 2, respectively. Forward model and s are trained using square-loss and Adam . The policies are trained according to SAC/DDPG+HER. Pseudo-code and implementation details can be found in Sections (B, F).
Appendix B Pseudocode
The pseudocode for the method is given in Algorithm 1.
Appendix C Environments
c.1 Synthetic environment
The synthetic environment is depicted in Fig. 2a and is simulated by the physics engine MuJoCo. The agent is modeled by a ball that is controlled by applying force in the and axis, so the agent’s action corresponds to a 2-dimensional vector:
The motion of the agent is subject to the laws of motion with the application of friction from the environment which makes it non-trivial to control. Other than the agent, the environment contains objects with different dynamics. The positions of the objects are part of the observation space of the agent along with a flag that specifies if the object has been picked up by the agent. We are dealing with a fully observable environment.
We define the goal spaces of the tasks as corresponding to the position of the individual objects. Some objects are harder to move than others and have other objects as dependencies. This means that the agent has to find this relation between them in order to successfully master the environment.
The types of objects that are used in the experiments are the following:
Static objects cannot be moved
Random objects move randomly in the environment, but cannot be moved by the agent
50% light objects can be moved in 50% of the rollouts
Tool can be moved and used to move the heavy object
Heavy objects can be moved when using the tool
The observation vector for objects is structured as follows , where is the position of the agent, is the position of the -th object and indicates whether the agent is in possession of the -th object. The goal spaces are the coordinates of the agent and the coordinates of each object .
c.2 Robotic environment
The robotic environment is depicted in Fig. 2c. The state space is 40 dimensional. It consists of the agent position and velocity, the gripper state, the absolute and relative positions of the box and the hook, respectively, as well as their velocities and rotations.
The environment is based on the OpenAI gym  PickAndPlace-v1 environment.
Final goals for the reach and tool task are sampled close to the initial gripper and tool location, respectively. Final goals for the object task are spawned in close proximity to the initial box position such that the box needs to be pulled closer to the robot but never pushed away. The box is spawned in close proximity to (closer to the robot) the upper end of the hook.
Appendix D Oracle baselines
d.1 CWYC with oracle goals
To assess the maximum performance of CWYC in the described settings, we crafted an upper baseline in which all learned high-level components, except for the final task selector , are fixed and set to their optimal value.
In the distractor setting, every task is solved by first doing the locomotion task. The goal proposal network returns always the state value , reflecting the ground truth relation we try to learn.
In the synthetic tool-use setting, the task graph depicted in Figure 10 is used. The goal proposal network returns always the state value , reflecting the ground truth relation we try to learn.
d.2 HIRO/SAC with oracle reward
To see if HIRO manages to solve the synthetic environment at all, we constructed a oracle version of HIRO. The oracle receives as input not only the distance from, e.g., tool to target position but additionally the distance from agent to tool. This signal is rich enough to allow HIRO to solve the tool manipulation task as shown in Fig. 8(d) in the main text, although it still takes a lot of time compared to CWYC. We trained the SAC baseline on the same hybrid reward as well.
Appendix E Additional analysis of the ablation studies
Appendix F Training Details and Parameters
f.1 Synthetic environment
# parallel rollout workers: 5
arena size: : 1600 : 1.0
lr: batch size: 64 policy type: gaussian discount: 0.99 reward scale: 5 target update interval: 1 tau (soft update) action prior: uniform reg: layer size (): 256 # layers (): 2 # train iterations: 200 buffer size:
lr: batch size: 64 input: confidence interval: 5 network type: MLP layer size: 100 : 5 # layers: 9 # train iterations: 100
Final task selector:
: lr: random_eps: 0.05 surprise history weighting: 0.99
: avg. window size: 100 surprise history weighting: 0.99 sampling_eps: 0.05
Goal proposal network:
lr: batch size: 64 L1 reg.: 0.0 L2 reg.: 0.0 init: 1.0 trainable: True # train iterations: 100
f.2 Robotic environment
# parallel rollout workers: 5
: 150 : 0.05
Q_lr: pi_lr: batch size: 256 polyak layer size (): 256 # layers (): 3 # train iterations: 80 buffer size: action_l2: 1.0 relative goals: false replay strategy: future replay_k: 4 random_eps: 0.3 noise_eps: 0.2
lr: batch size: 64 input: confidence interval: 3 network type: MLP layer size: 100 : 3 # layers: 9 # train iterations: 100
Final task selector:
: lr: random_eps: 0.05 surprise history weighting: 0.99
: avg. window size: 100 surprise history weighting: 0.99 sampling_eps: 0.05
Goal proposal network:
lr: batch size: 64 L1 reg.: 0.0 L2 reg.: 0.0 init: 1.0 trainable: True # train iterations: 30