Autonomous Reinforcement Learning of Multiple Interrelated Tasks

Autonomous multiple tasks learning is a fundamental capability to develop versatile artificial agents that can act in complex environments. In real-world scenarios, tasks may be interrelated (or "hierarchical") so that a robot has to first learn to achieve some of them to set the preconditions for learning other ones. Even though different strategies have been used in robotics to tackle the acquisition of interrelated tasks, in particular within the developmental robotics framework, autonomous learning in this kind of scenarios is still an open question. Building on previous research in the framework of intrinsically motivated open-ended learning, in this work we describe how this question can be addressed working on the level of task selection, in particular considering the multiple interrelated tasks scenario as an MDP where the system is trying to maximise its competence over all the tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

05/07/2019

Autonomous Open-Ended Learning of Interdependent Tasks

Autonomy is fundamental for artificial agents acting in complex real-wor...
12/17/2020

Intrinsically Motivated Goal-Conditioned Reinforcement Learning: a Short Survey

Building autonomous machines that can explore open-ended environments, d...
03/05/2019

Open-Sourced Reinforcement Learning Environments for Surgical Robotics

Reinforcement Learning (RL) is a machine learning framework for artifici...
10/20/2019

Autonomous Industrial Management via Reinforcement Learning: Self-Learning Agents for Decision-Making – A Review

Industry has always been in the pursuit of becoming more economically ef...
11/17/2015

Active exploration of sensor networks from a robotics perspective

Traditional algorithms for robots who need to integrate into a wireless ...
09/23/2021

Individual and Collective Autonomous Development

The increasing complexity and unpredictability of many ICT scenarios let...
02/06/2021

An Autonomous Negotiating Agent Framework with Reinforcement Learning Based Strategies and Adaptive Strategy Switching Mechanism

Despite abundant negotiation strategies in literature, the complexity of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomy, intended as the capability of a system to behave and learn without pre-assigned tasks or externally provided knowledge (i.e. human designers knowledge), is a paramount challenge for the development of artificial agents that can act in complex and continuously changing real-world scenarios. Acquiring many different skills is the necessary starting point to foster versatility and adaptation, thus autonomous open-ended learning of skills can be considered one of the main topics for research in robotics. While the learning of multiple skills can be addressed through different machine learning techniques by sequentially assigning to the robot a series of

tasks, autonomy implies that the agent has the capability to select on which task to focus and shift between them possibly in a smart way.

Intrinsic motivations (IMs) have been used in the field of machine learning and developmental robotics [1, 2, 3] to provide self-generated reinforcement signals driving exploration and skill learning [4, 5, 6, 7]. Other studies [8, 9, 10] implemented IMs as a motivational signal for the autonomous selection of tasks. Some tasks can be defined in terms of “goals”, that is desirable environment states the agent might aim to accomplish (here we focus on this type of tasks and so we use the terms “goal” and “task” interchangeably).

The learning progress in achieving a goal can be used as a transient reward so that the system focuses on tasks where it is learning the most, moving to other ones when the task-related skill has been completely learnt or when more promising activities come at hand [11, 12]. This strategy allows the learning of multiple separated skills, and possibly a dynamical transfer of knowledge between tasks that require similar policies [13].

In real-world scenarios goals may need specific initial conditions to be performed; or they may be interrelated so that to achieve a task the robot needs first to learn and accomplish other ones. This last case is of particular interest and although it has been studied under different headings it is still an open issue from an autonomous open-ended learning perspective.

Hierarchical reinforcement learning [14] has been combined with IMs to allow for the autonomous formation of skills sequences, but usually these methods tackle only discrete states and actions domains [15]; or focus on the discovery of sub-goals on the basis of externally given tasks [16]; or under the assumption that sub-goals come as predefined rewards [17], thereby reducing the autonomy of the agent during the learning process. Trajectories with via points [18] and parametrised skills [19]

are able to learn multiple motor trajectories, but this is commonly done considering single tasks/skills or assuming pre-defined tasks. Imitation learning has achieved important results in the learning of task hierarchies

[20, 21, 22], even in association with IMs [23], but by definition it relies on external knowledge sources (i.e. the “instructor”) thus limiting the autonomy of the systems.

Without considering other important issues for life-long open-ended learning such as the autonomous discovery [24, 9, 25] or the autonomous generation of tasks/goals [26], in this paper we describe how learning multiple interrelated tasks can be tackled focusing on the level of task selection, providing an analysis of the problem together with the proposed solution (Sec. II). Moreover, we test our hypothesis by comparing different systems (Sec. III-B), implemented as enhancements of the GRAIL architecture [9], in a simulated robotic scenario involving multiple interrelated tasks (Sec. III-A).

Ii Description of the problem and proposed solution

From a reinforcement learning (RL) perspective [27], the learning of multiple goals can be seen as the learning of different policies , each one associated to a different goal . Those policies maximise the return provided by the reward function associated with goal (see also [28, 29]). For each the system thus aims to learn a policy:

(1)

Since we are considering an open-ended learning scenario where no specific task is assigned to the robot, we suppose the system is not maximising extrinsic rewards, but rather a competence function over the distribution of goals . Here, is the sum of the agent’s competence at each goal as made possible by a given candidate goal-selection policy . In other words, the overall competence is a measure of the agent’s ability to efficiently accomplish different goals by allocating its learning time among them using a given policy associated with an MDP where the agent is learning to maximise the competence for that goal rather than the specific reward for that goal. If we consider a finite time horizon , the robot needs to properly allocate its training time to the goals that guarantee the highest competence gain. To do so, the system may use the current derivative of the competence (w.r.t. time) as an intrinsic motivation signal to select the goal with the highest competence improvement at each time step , where time here refers to one training step over a given task (the efficacy of this approach has been shown in different works within the intrinsically motivated open-ended learning framework [11, 30, 31]). The problem of task selection can thus be described as an -armed bandit [27] (possibly a rotting bandit [32] due to the non-stationary transient nature of IMs) where the agent learns a policy to select goals that maximise the current competence improvement :

(2)

If we constrain the feasibility of the goals to specific environmental conditions, goal selection becomes a contextual bandit problem [27] where the robot has to learn the value of goals and selects them depending on the current state . Equation 2 thus becomes:

(3)

where now the policy for selecting goals for training needs to explicitly take into account the current state of the agent, which may include information such as which other goals have already been accomplished. By making this change to the objective, the system can bias the choice using the expected competence gain for each goal given different conditions. The evaluation of the competence improvement for each goal can be done via a state-based moving average of performance at achieving that goal given the current policy.

If we now further assume a situation where goals are interrelated, so that a goal may be a precondition for other ones, we shift to a different kind of problem where the state of the environment depends on previously selected (and possibly achieved) goals. A sequence of contextual bandits where the context at time is determined by the action (here goal selection) executed at time , can be seen as an MDP over all the goal-specific MDPs for which the robot is learning the policy (a “skill”). This is the typical situation of hierarchical skill learning that is still scarcely addressed within a fully autonomous open-ended framework.

What we propose is that, given the structure of the problem, goal selection for multiple interrelated tasks can be treated as an MDP and, consequently, can be addressed via RL algorithms that transfer intrinsic-motivation values between interrelated goals. In particular, in the following sections we show how a system implementing goal selection with a standard Q-learning algorithm [27] is able to outperform systems that treat it as a standard bandit or contextual bandit problem.

Figure 1: The simulated iCub in the experimental setup. When a sphere is touched (given its preconditions) it “lights up” changing its colour to green.

Iii Methods: setup and system

Iii-a The robot and the experimental scenarios

To test our hypothesis we designed an experimental scenario where a simulated iCub robot [33]

has to perform multiple interrelated tasks consisting in touching-to-activate different spheres anchored to the world (the spheres “float” in front of the robot). Notwithstanding its simplicity, this setup allow us to test all the issues related to goal-selection and skill learning of multiple interrelated tasks. We use the two arms of the iCub robot each with 4 degrees-of-freedom (DOFs) while the joints of the wrists are kept fixed and hands are substituted with 2 scoops. Collisions are disabled in the simulator while a sensor in the centre of each scoop determines whether the robot touches one of the spheres. Some spheres may be “conditioned” to other ones, so that to activate one of them the robot has to previously activate other spheres. In this way if the system wants to learn a skill, it has to set the environment in the proper condition, i.e. it has to first select and achieve other goals that constitute the precondition for the one to be trained.

We designed different variations of the general setup to better show the advantages of our solution. In particular we run three different experiments:

  1. No relations / N-armed bandit: all the spheres (here six) can be activated independently from the state of the environment and from the other spheres.

  2. Environmental dependence / Contextual bandit

    : the activation of a sphere, by having the robot touch it, is dependent on some environmental variable. In this setting we assume a state feature (the “contextual feature”) that is set to 1.0 with 50% probability at the beginning of each trial, and to 0.0 otherwise. The six spheres the systems is learning to achieve here depend on the context: three of them can only be activated when the contextual feature is on, and the other three only when it is off.

  3. Multiple Interrelated Tasks / MDP: the “achievability” of a task (activation of a sphere) is now dependent on the activation status of the other spheres. In this scenario, the fact that the robot has previously achieved or not a goal (or set of goals) constitutes the precondition for the achievement of other goals, thus introducing interdependencies between the available tasks (see Sec. IV-C for details).

Iii-B Compared systems

All the compared systems are developed building on a previous architecture called GRAIL [9], developed for the autonomous discovery and intrinsically motivated learning of goals. Due to paper length constraints here we describe only the features of GRAIL (and the modifications proposed in the current work) that are useful for the understanding of the presented results, and we invite readers to refer to the cited work for further details.

GRAIL has a high-level component, the goal selector (GS), determining at each trial the task the system is pursuing (see Sec. IV for a description of the experimental scheduling). The selected goal is then used by the expert selector (ES) to choose with which module (the expert) to learn the low-level control policy (the skill) for that task. There are two experts associated with each goal: for each goal, the ES can choose one between them to control either one of the two arms of the robot. This gives a higher versatility to the robot (see [34]), but in this work we do not focus on this aspect. At the lower-level of skill learning (training of the policy), any implementation could be used. In GRAIL we developed each expert as an actor-critic network modified to work with continuous state and action spaces [35] and trained through a TD-Learning algorithm on the basis of the pseudo-reward signals generated for achieving the selected goal (lighting up the target sphere). At every time step the selected expert receives as input the angles of the four actuated joints of the arm and returns as output four desired joint angles to move the arm through position control.

The autonomous selection of the goals is performed by the GS according to a competence-based intrinsic motivation signal (CB-IM) calculated over each goal (see [12] for the comparison of different types of IMs). In particular, the CB-IM signal of a goal is the competence prediction improvement (CPI) of a predictor that receives as input the selected goal and produces as output the predicted probability of the achievement of that goal within the trial.

Figure 2: A schema of the architecture implemented in C-GRAIL and M-GRAIL. Differently from GRAIL, the new architectures use context as input to the goal selector. Note that for all the architectures the expert selector and experts are goal-specific.

As described in sec. II the core of the autonomous learning process resides in the goal selection process happening in the GS. In the original version of GRAIL the GS receives no input and gives as output the selected goal as in a standard bandit setting, where each arm/goal is evaluated on the basis of an exponential moving average (EMA), with smoothing factor set to 0.01, of the previously-achieved intrinsic rewards (the CPI). Here we will present two different versions of GRAIL (see Fig. 2 for a general schema of the new architectures) that, by modifying the GS component, are able to cope with the added complexity of the scenarios described in Sec. III-A. The first, called Contextual-GRAIL (C-GRAIL) provides as input to the GS the state of the environment, which can be composed of standard state features or also features describing the status of different goals (e.g features describing whether each sphere is activated). The GS then selects the tasks as in a contextual bandit where different EMAs (with smoothing factor set to ) are associated with different contexts, and with the same softmax selection rule as in GRAIL. A second version, called Markovian-GRAIL (M-GRAIL), provides the same input to the GS as C-GRAIL, but treats goal selection as a reinforcement learning MDP and solves it by modeling the temporal interdependency between goals as the temporal dependency between consecutive states in an MDP; it then uses Q-learning (with a learning rate of and a discount factor set to ) to assign a value to each goal. In particular, here values represent the long-term benefits of practicing a goal considering the intrinsic rewards that other goals that depend on it may provide in the future. Goal selection follows the same softmax selection rule as the previous system.

Moreover, C-GRAIL and M-GRAIL use the information on the status of the spheres (the contextual input): (a) to generate condition-specific CB-IMs for each goal and (b) to avoid a “disruptive” training of the low-level policies. Mechanism (a) is implemented by providing the contextual input also to the predictor that generates the CPI signal. Mechanism (b) is implemented by blocking the learning of the selected expert when the predictor of the selected goal has a output, unless the goal is achieved. This avoids situations where a goal is selected even if its preconditions are not satisfied: a trained expert would bring the robot on the sphere but there would be no effect (the sphere would not activate) and thus no reward signal, generating a “disruptive” modification of the policy. This “error” would not be due to an incorrect policy but to an improper selection of the GS. While the latter selection has to be “punished”, the actor-critic does not have to modify its policy. Note that while in the original version of GRAIL mechanism (b) could be implemented only using the knowledge of the designers, C-GRAIL and M-GRAIL are able to autonomously regulate this process.

Iv Results

This section presents the results of the three experiments described in Sec. III-A. All data are averages over ten replications of the tested systems. The experimental details of each scenario are presented at the beginning of the related sub-sections.

Iv-a First experiment: no relations between tasks

Figure 3:

Performance of GRAIL in the first experiment. Average over 10 replications of the experiment. Shadows show the confidence intervals.

This can be considered as the baseline experiment, where we just show the performance of the original GRAIL in a scenario where the 6 tasks presented to the robot (learning to activate the 6 spheres) have no relations with the specific conditions of the environment, nor between each other. We run the experiment for 3000 trials, each one ending when the robot collides with one of the spheres or after a timeout of 800 steps. After each trial we reset the environment, i.e. the activated spheres are set to off. At the beginning of each trial, the GS selects, as in a bandit problem, the task the system tries to achieve and then the related expert is trained.

Fig. 3 shows how GRAIL is able to perfectly learn all the tasks in 1700 trials, properly shifting between them during the simulation thanks to the CB-IMs generated for the improvement of the different skills.

Iv-B Second experiment: environmental dependence

Figure 4: Performance of GRAIL and C-GRAIL in the second experiment.
Figure 5: Trials wasted by GRAIL and C-GRAIL in selecting tasks that cannot be performed.

In this second experiment we compare GRAIL and C-GRAIL in a modified setup where the value of a contextual feature () is used as precondition to determine whether the agent can activate certain spheres: spheres , and can be activated only when the is set to 1.0, while spheres , and only when is set to 0.0. At the beginning of each trial, is set to 1.0 with 50% probability. While GRAIL selects tasks without considering the environmental condition, C-GRAIL receives status as input and performs task selection as in a contextual bandit.

Fig. 4 shows the performances of GRAIL and C-GRAIL on the 6 tasks during the experiment that lasted 4000 trials. C-GRAIL is able to properly learn all the tasks in 2000 trials, while at the end of the simulation GRAIL has reached a high competence (over 80%) only in 2 tasks. This because GRAIL performs goal evaluation and selection without considering the status of the which instead is decisive for spheres activation. As shown by Fig. 5, while C-GRAIL only wastes trials in “random” selection when all the tasks are properly learnt, this meaning that it is able to generate an IM for the different tasks only in the conditions where they can be actually achieved, GRAIL wastes time from the beginning of the experiment in selecting tasks also when they cannot be trained, thus impairing the learning process.

Iv-C Third experiment: multiple interrelated tasks

The third experiment increases the complexity of the scenario with the introduction of interrelations between tasks. This allows us to test the validity of our main hypothesis, implemented in the M-GRAIL system. In particular, in this experiment we have two “chains” of interdependent tasks (see Fig. 6): spheres and (where arrows indicate that a sphere is the precondition for the following one). Moreover spheres and

, at the beginning of the two sequences, are mutually exclusive, so that if the robot starts one chain it cannot turn on the spheres of the other. Since here we focus on task-interdependencies and sequences, the scheduling is as follows: each simulation is run for 2000 “epochs”, where each epoch lasts 3 trials (6000 trials in total). At the end of each epoch we reset the environment, while during the epoch the spheres remain in the status determined by the activity of the robot.

Figure 6: Structure of the third experimental scenario. Black arrows indicate positive dependencies, and red arrows negative dependencies.
Figure 7: Performance of C-GRAIL and M-GRAIL in the third experiment.
Figure 8: Trials wasted by C-GRAIL and M-GRAIL in selecting tasks that cannot be performed.

From the second experiment (Sec. IV-B) we know that GRAIL system is not able to properly perform autonomous learning when tasks are dependent on some preconditions, so here we tested only C-GRAIL and M-GRAIL. Both systems get the on/off status of the six spheres as input to the goal selector, but they implement value assignment in the two different ways explained in Sec. III-B.

The performance of the two systems is presented in Fig. 7. In this scenario C-GRAIL can properly learn 4 tasks but it is not able to achieve high competence on spheres and , the last ones of the two chains. Instead, M-GRAIL reach a high performance in all the skills: 5 over 6 are completely learnt during the simulations and sphere

reaches a performance close to 90% (on average over the 10 replications). Although both systems are able to assign values to goals only in the states where their preconditions are satisfied, C-GRAIL suffers from the fact that some tasks are “farther” from the initial condition (all spheres off). Whenever a task is completely learnt (the robot has an optimal policy for performing a goal), the intrinsic motivation for selecting it gradually disappears. This may lead to a situation where the robot starts selecting tasks almost at random due to the absence of intrinsic rewards, thus wasting trials in selecting goals that cannot be achieved at that moment (Fig.

8). As a result, even though the robot may have an intrinsic motivation reward for practicing a goal (e.g. activating sphere and ), it does not have intrinsic motivation for first practicing the goals that are preconditions to and to ; it is not, thus, capable of systematically putting the environment in the proper conditions to train the last spheres of the chains. On the contrary, M-GRAIL can rapidly learn all tasks: even when (similarly to C-GRAIL) it is no longer intrinsically motivated in achieving “simple” goals per se (i.e., goals with few preconditions), it ensures that the robot continues to select those goals thanks to the Q-learning algorithm, which propagates the intrinsic motivations for solving task and back to the tasks that are their preconditions. Thanks to this strategy in goal selection, M-GRAIL starts wasting trials only when all the goals have reached a high performance (Fig. 8).

V Conclusions

In this work we tackled the crucial problem of autonomous learning of multiple interrelated tasks in robotic scenarios. As described in Sec. II, our hypothesis is that open-ended learning of skills should be treated as a problem of active task selection to be solved with different strategies given the structure of the experimental scenario. In particular, when the tasks are independent from environmental conditions and from each other, a classical -armed bandit strategy is sufficient to properly train the low-level skills related to each task (Sec. IV-A). When the tasks are conditioned to specific conditions of the environment independent from robot activity, treating the problem as a bandit allows a proper learning of the skills (Sec. IV-B). But if we want to autonomously learn sequences of interdependent tasks, we have to consider task selection as an MDP where the choices of the system determine changes in the state of the environment (sec IV-C). In this way the system can use algorithms such as Q-learning to transfer the intrinsic-motivation value for learning a task to the related ones, thus allowing hierarchical skill learning.

This is particularly important for open-ended learning scenarios, where task-agnostic motivation signals are used to drive goal selection and skill learning. These heuristics are perfect to smartly shift from one task to another, since the intrinsic value of a goal remains only when the system has something to learn and disappears otherwise. However, this could be a problem if a task constitutes the precondition for another one, so that the robot needs to perform the first even if the motivation for it has faded out. A system such as the presented M-GRAIL is able to keep the advantages of intrinsically motivated learning and at the same time to cope with the crucial problem of multiple interrelated tasks learning. Kulkarni and colleagues

[29] presented an interesting model combining hierarchical deep reinforcement learning and IMs in a two levels architecture similar to M-GRAIL, where a high-level meta-learner is selecting goals and a lower lever controller is learning the policies to achieve those goals. However, in their work the higher level is maximising extrinsic rewards (achieving goals) while intrinsic motivations are used for the lower levels: differently from what we are tackling in our study, this solution leads a system to focus on those goals that are providing more rewards instead of making the system learning many different goals.

In all the different versions of GRAIL the experts that acquire the skills are implemented as actor-critic neural networks. However, any other more efficient method can be used, e.g. parametrised skills such as Dynamic Movement Primitives (or variations of them) trained through policy search algorithms

[36]. The generalisation of the acquired skills over new/different targets was not the focus of our work. Using different kind of controllers might facilitate this process, as well as leveraging on visual input to guide motor behaviour and to perform target recognition [37].

A stronger limitation of M-GRAIL resides in the fact that although it can select (and learn) hierarchical tasks, the system is not able to retain these “chains” after the learning process, i.e. to be able to select and perform them as whole skills. Modifying the structure of the experts could provide a solution to this limitation, as well as considering the implementation of a planner on top of M-GRAIL to compose sequences of the autonomously acquired skills using higher-level encoding [38].

Acknowledgment

This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement no 307010 (GOAL-Robots - Goal-based Open-ended Autonomous Learning Robots).

References

  • [1] P.-Y. Oudeyer, F. Kaplan, and V. Hafner, “Intrinsic motivation systems for autonomous mental development,”

    IEEE transactions on evolutionary computation

    , vol. 11, no. 6, 2007.
  • [2] J. Schmidhuber, “Formal theory of creativity, fun, and intrinsic motivation (1990–2010),” IEEE Transactions on Autonomous Mental Development, vol. 2, no. 3, pp. 230–247, 2010.
  • [3] G. Baldassarre and M. Mirolli, Intrinsically Motivated Learning in Natural and Artificial Systems.   Springer Science & Business Media, 2013.
  • [4] V. G. Santucci, G. Baldassarre, and M. Mirolli, “Cumulative learning through intrinsic reinforcements,” in Evolution, Complexity and Artificial Life.   Springer, 2014, pp. 107–122.
  • [5] M. B. Hafez, C. Weber, and S. Wermter, “Curiosity-driven exploration enhances motor skills of continuous actor-critic learner,” in Proceedings of the 7th Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), 2017.
  • [6] P. Dhakan, K. Merrick, I. Rañó, and N. Siddique, “Intrinsic rewards for maintenance, approach, avoidance, and achievement goal types,” Frontiers in neurorobotics, vol. 12, 2018.
  • [7] D. Tanneberg, J. Peters, and E. Rueckert, “Intrinsic motivation and mental replay enable efficient online adaptation in stochastic recurrent networks,” Neural Networks, vol. 109, pp. 67–80, 2019.
  • [8]

    A. Baranes and P.-Y. Oudeyer, “Active learning of inverse models with intrinsically motivated goal exploration in robots,”

    Robotics and Autonomous Systems, vol. 61, no. 1, pp. 49–73, 2013.
  • [9] V. G. Santucci, G. Baldassarre, and M. Mirolli, “Grail: A goal-discovering robotic architecture for intrinsically-motivated learning,” IEEE Transactions on Cognitive and Developmental Systems, vol. 8, no. 3, pp. 214–231, 2016.
  • [10] S. Forestier, Y. Mollard, and P.-Y. Oudeyer, “Intrinsically motivated goal exploration processes with automatic curriculum learning,” arXiv preprint arXiv:1708.02190, 2017.
  • [11] M. Lopes and P.-Y. Oudeyer, “The strategic student approach for life-long exploration and learning,” in Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Conference on.   IEEE, 2012, pp. 1–8.
  • [12] V. G. Santucci, G. Baldassarre, and M. Mirolli, “Which is the best intrinsic motivation signal for learning multiple skills?” Frontiers in neurorobotics, vol. 7, p. 22, 2013.
  • [13] K. Seepanomwan, V. G. Santucci, and G. Baldassarre, “Intrinsically motivated discovered outcomes boost user’s goals achievement in a humanoid robot,” in 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), 2017, pp. 178–183.
  • [14] A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete event dynamic systems, vol. 13, no. 1-2, pp. 41–77, 2003.
  • [15] C. M. Vigorito and A. G. Barto, “Intrinsically motivated hierarchical skill learning in structured environments,” IEEE Transactions on Autonomous Mental Development, vol. 2, no. 2, pp. 132–143, 2010.
  • [16] B. Bakker and J. Schmidhuber, “Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization,” in Proc. of the 8-th Conf. on Intelligent Autonomous Systems, 2004, pp. 438–445.
  • [17] R. Niel and M. A. Wiering, “Hierarchical reinforcement learning for playing a dynamic dungeon crawler game,” in 2018 IEEE Symposium Series on Computational Intelligence (SSCI).   IEEE, 2018, pp. 1159–1166.
  • [18] R. F. Reinhart, “Autonomous exploration of motor skills by skill babbling,” Autonomous Robots, vol. 41, no. 7, pp. 1521–1537, 2017.
  • [19] B. C. Da Silva, G. Baldassarre, G. Konidaris, and A. Barto, “Learning parameterized motor skills on a humanoid robot,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on.   IEEE, 2014, pp. 5239–5244.
  • [20] D. H. Grollman and O. C. Jenkins, “Incremental learning of subtasks from unsegmented demonstration,” in Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on.   IEEE, 2010, pp. 261–266.
  • [21] A. Mohseni-Kabir, C. Li, V. Wu, D. Miller, B. Hylak, S. Chernova, D. Berenson, C. Sidner, and C. Rich, “Simultaneous learning of hierarchy and primitives for complex robot tasks,” Autonomous Robots, pp. 1–16, 2018.
  • [22] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 6292–6299.
  • [23] N. Duminy, S. M. Nguyen, and D. Duhaut, “Learning a set of interrelated tasks by using a succession of motor policies for a socially guided intrinsically motivated learner,” Frontiers in neurorobotics, vol. 12, 2018.
  • [24] M. Rolf and M. Asada, “Autonomous development of goals: From generic rewards to goal and self detection,” in Development and Learning and Epigenetic Robotics (ICDL-Epirob), 2014 Joint IEEE International Conferences on.   IEEE, 2014, pp. 187–194.
  • [25] L. Meeden and D. Blank, “Developing grounded goals through instant replay learning,” in The Seventh Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics, 2017.
  • [26] E. Cartoni and G. Baldassarre, “Autonomous discovery of the goal space to learn a parameterized skill,” arXiv preprint arXiv:1805.07547, 2018.
  • [27] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 1998.
  • [28] C. Florensa, D. Held, X. Geng, and P. Abbeel, “Automatic goal generation for reinforcement learning agents,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 1514–1523.
  • [29] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in neural information processing systems, 2016, pp. 3675–3683.
  • [30] V. G. Santucci, G. Baldassarre, and M. Mirolli, “Intrinsic motivation signals for driving the acquisition of multiple tasks: a simulated robotic study,” in Proceedings of the 12th International Conference on Cognitive Modelling (ICCM), 2013.
  • [31] K. E. Merrick, “Intrinsic motivation and introspection in reinforcement learning,” IEEE Transactions on Autonomous Mental Development, vol. 4, no. 4, pp. 315–329, 2012.
  • [32] N. Levine, K. Crammer, and S. Mannor, “Rotting bandits,” in Advances in Neural Information Processing Systems, 2017, pp. 3074–3083.
  • [33] G. Metta, G. Sandini, D. Vernon, L. Natale, and F. Nori, “The icub humanoid robot: an open platform for research in embodied cognition,” in Proceedings of the 8th workshop on performance metrics for intelligent systems.   ACM, 2008, pp. 50–56.
  • [34] V. G. Santucci, G. Baldassarre, and M. Mirolli, “Autonomous selection of the “what” and the “how” of learning: an intrinsically motivated system tested with a two armed robot,” in Development and Learning and Epigenetic Robotics (ICDL-Epirob), 2014 Joint IEEE International Conferences on.   IEEE, 2014, pp. 434–439.
  • [35] K. Doya, “Reinforcement learning in continuous time and space,” Neural computation, vol. 12, no. 1, pp. 219–245, 2000.
  • [36] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert, “Learning movement primitives,” in Robotics research. the eleventh international symposium.   Springer, 2005, pp. 561–572.
  • [37] V. Sperati and G. Baldassarre, “Bio-inspired model learning visual goals and attention skills through contingencies and intrinsic motivations,” IEEE Transactions on Cognitive and Developmental Systems, vol. 10, no. 2, pp. 326–344, 2017.
  • [38] G. Konidaris, L. P. Kaelbling, and T. Lozano-Perez, “From skills to symbols: Learning symbolic representations for abstract high-level planning,”

    Journal of Artificial Intelligence Research

    , vol. 61, pp. 215–289, 2018.