Automatic Curriculum Learning For Deep RL: A Short Survey

03/10/2020 ∙ by Rémy Portelas, et al. ∙ Inria 30

Automatic Curriculum Learning (ACL) has become a cornerstone of recent successes in Deep Reinforcement Learning (DRL).These methods shape the learning trajectories of agents by challenging them with tasks adapted to their capacities. In recent years, they have been used to improve sample efficiency and asymptotic performance, to organize exploration, to encourage generalization or to solve sparse reward problems, among others. The ambition of this work is dual: 1) to present a compact and accessible introduction to the Automatic Curriculum Learning literature and 2) to draw a bigger picture of the current state of the art in ACL to encourage the cross-breeding of existing concepts and the emergence of new ideas.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human learning is organized into a curriculum of interdependent learning situations of various complexities. Homer surely learned to formulate words before he could compose the Iliad. This idea was first transposed to machine learning in selfridge selfridge, where authors designed a

learning scheme

to train a cart pole controller: first training on long and light poles, then gradually moving towards shorter and heavier poles. In the following years, curriculum learning was applied to organize the presentation of training examples or the growth in model capacity in various supervised learning settings 

[15, 31, 7]. In parallel, the developmental robotics community proposed learning progress as a way to automatically organize the developmental trajectories of learning agents [30]. Inspired by these earlier works, the DRL community developed a family of mechanisms called Automatic Curriculum Learning, which we propose to define as follows:

Automatic Curriculum Learning (ACL) for DRL is a family of mechanisms that automatically adapt the distribution of training data by adjusting the selection of learning situations to the capabilities of learning agents.

Related fields -

ACL shares many connections with other fields. For example, ACL can be used in the context of Transfer Learning where agents are trained on one distribution of tasks and tested on another [57]. Continual Learning trains agents to be robust to unforeseen changes in the environment while ACL assumes agents to stay in control of learning scenarios [33]. Policy Distillation techniques [14] form a complementary toolbox to target multi-task RL settings, where knowledge can be transferred from one policy to another (e.g. from task-expert policies to a generalist policy).

Scope -

This short survey proposes a typology of ACL mechanisms when combined with DRL algorithms and, as such, does not review population-based algorithms implementing ACL (e.g. imgep imgep, poet poet). ACL refers to mechanisms explicitly optimizing the automatic organization of training data. Hence, they should not be confounded with emergent curricula, by-products of distinct mechanisms. For instance, the on-policy training of a DRL algorithm is not considered ACL, because the shift in the distribution of training data emerges as a by-product of policy learning. Given this is a short survey, we do not present the details of every particular mechanism. As the current ACL literature lacks theoretical foundations to ground proposed approaches in a formal framework, this survey focuses on empirical results.

2 Automatic Curriculum Learning for DRL

This section formalizes the definition of ACL for Deep RL and proposes a classification.

Deep Reinforcement Learning (DRL)

is a family of algorithms which leverage deep neural networks for function approximation to tackle reinforcement learning problems. DRL agents learn to perform sequences of actions

given states in an environment so as to maximize some notion of cumulative reward  [56]. Such problems are usually called tasks

and formalized as Markov Decision Processes (MDPs) of the form

where is the state space, is the action space,

is a transition function characterizing the probability of switching from the current state

to the next state given action , is a reward function and is a distribution of initial states. To challenge the generalization capacities of agents [10], the community introduced multi-task DRL problems where agents are trained on tasks sampled from a task space: . In multi-goal DRL, policies and reward functions are conditioned on goals, which augments the task-MDP with a goal space  [51].

Automatic Curriculum Learning

mechanisms guide DRL algorithms to achieve objectives that can be formalized as maximization of some metric computed over a distribution of tasks after training episodes:

(1)

where quantifies the agent’s behavior on task after episodes (e.g. cumulative reward, exploration score).

ACL Typology -

We propose a classification of ACL mechanisms based on three dimensions:

  1. [leftmargin=0.45cm, nolistsep]

  2. Why using ACL? We review the different objectives that ACL has been used for (Section 3).

  3. What does ACL control? ACL can target different aspects of the learning problem (e.g. environments, goals, reward functions, Section 4)

  4. What does ACL optimize? ACL mechanisms usually target surrogate objectives (e.g. learning progress, diversity) to alleviate the difficulty to optimize the main objective directly (Section 5).

3 Why using ACL?

ACL has been used for diverse non-exclusive purposes.

Improving performance on a restricted task set -

Classical RL problems are about solving a given task, or a restricted task set (e.g. which vary by their initial state). In these simple settings, ACL has been used to improve sample efficiency or asymptotical performance [52, 26].

Solving hard tasks -

Sometimes the target tasks cannot be solved directly (e.g. too hard or sparse rewards). In that case, ACL can be used to pose auxiliary tasks to the agent, gradually guiding its learning trajectory from simple to difficult tasks until the target tasks are solved [36, 19, 48, 27, 50]. Another line of work proposes to use ACL to organize the exploration of the state space so as to solve sparse reward problems [6, 42, 53, 43, 8]. In these works, the performance reward is augmented with an intrinsic reward guiding the agent towards uncertain areas of the state space.

Training generalist agents -

Generalist agents must be able to solve tasks they have not encountered during training (e.g. continuous task spaces or distinct training and testing set). ACL can shape learning trajectories to improve generalization, e.g. by avoiding unfeasible task subspaces [46]. ACL can also help agents to generalize from simulation settings to the real world (Sim2Real) [41, 37] or to maximize performance and robustness in multi-agent settings via Self-Play [54, 44, 3, 2, 58].

Training multi-goal agents -

In multi-goal RL, agents are trained and tested on tasks that vary by their goals. Because agents can control the goals they target, they learn a behavioral repertoire through one or several goal-conditioned policies. The adoption of ACL in this setting can improve performance on a testing set of pre-defined goals [1, 55, 60, 21, 18, 61, 47, 9, 17, 12].

Organizing open-ended exploration -

In some multi-goal settings, the space of achievable goals is not known in advance. Autonomous agent must discover achievable goals as they explore and learn how to reach them. For this problem, ACL can be used to organize the discovery and acquisition of repertoires of robust and diverse behaviors [16, 32, 45, 28, 11].

4 What does ACL control?

While on-policy DRL algorithms directly use training data generated by the current behavioral policy, off-policy algorithms can use trajectories collected from other sources. This practically decouples data collection from data exploitation. Hence, we organize this section into two categories: one reviewing ACL for data collection, the other ACL for data exploitation.

Figure 1: ACL for data collection. ACL can control each elements of task MDPs to shape the learning trajectories of agents. Given metrics of the agent’s behavior like performance or visited states, ACL methods generate new tasks adapted to the agent’s abilities.

4.1 ACL for Data Collection

During data collection, ACL organizes the sequential presentation of tasks as a function of the agent’s capabilities. To do so, it generates tasks by acting on elements of task MDPs (e.g. , see Fig. 1). The curriculum can be designed on a discrete set of tasks or on a continuous task space. In single-task problems, ACL can define a set of auxiliary tasks to be used as stepping stones towards the resolution of the main task. The following paragraphs organize the literature according to the nature of the control exerted by ACL:

Initial state -

The distribution of initial states can be controlled to modulate the difficulty of a task. Agents start learning from states close to a given target (i.e. easier tasks), then move towards harder tasks by gradually increasing the distance between the initial states and the target. This approach is especially effective to design auxiliary tasks for complex control scenarios with sparse rewards [19, 27, 50].

Reward functions -

ACL can be used for automatic reward shaping: adapting the reward function as a function of the learning trajectory of the agent. In curiosity-based approaches especially, an internal reward function guides agents towards areas associated with high uncertainty to foster exploration [6, 42, 53, 43, 8]. As the agent explores, uncertain areas –and thus the reward function– change, which automatically devises a learning curriculum guiding the exploration of the state space. In fournier-accuracy-acl fournier-accuracy-acl, an ACL mechanism controls the tolerance in a goal reaching task. Starting with a low accuracy requirement, it gradually and automatically shifts towards stronger accuracy requirements as the agent progresses. In diayn diayn and metarl-carml metarl-carml, authors propose to learn a skill space in unsupervised settings (from state space and pixels respectively), from which are derived reward functions promoting both behavioral diversity and skill separation.

Goals -

In multi-goal DRL, ACL techniques can be applied to order the selection of goals from discrete sets [32], continuous goal spaces [55, 18, 45, 47] or even sets of different goal spaces [12]. Although goal spaces are usually pre-defined, recent work proposed to apply ACL on a goal space learned from pixels using a generative model [45].

Environments -

ACL has been successfully applied to organize the selection of environments from a discrete set, e.g. to choose among Minecraft mazes [36] or Sonic the Hedgehog levels [39]. A more general –and arguably more powerful– approach is to leverage parametric Procedural Content Generation (PCG) techniques [49] to generate rich task spaces. In that case, ACL allows to detect relevant niches of progress [41, 46, 37].

Opponents -

Self-play algorithms train agents against present or past versions of themselves [54, 3, 58, 2]. The set of opponents directly maps to a set of tasks, as different opponents results in different transition functions and possibly state spaces . Self-play can thus be seen as a form of ACL, where the sequence of opponents (i.e. tasks) is organized to maximize performance and robustness. In single-agent settings, an adversary policy can be trained to perturb the main agent [44].

4.2 ACL for Data Exploitation

ACL can also be used in the data exploitation stage, by acting on training data previously collected and stored in a replay memory. It enables the agent to “mentally experience the effects of its actions without actually executing them”, a technique known as experience replay [34]. At the data exploitation level, ACL can exert two types of control on the distribution of training data: transition selection and transition modification.

Transition selection -

Inspired from the prioritized sweeping technique that organized the order of updates in planning methods [38], per per introduced prioritized experience replay (PER) for model-free RL to bias the selection of transitions for policy updates, as some transitions might be more informative than others. Different ACL methods propose different metrics to evaluate the importance of each transition [52, 60, 12, 61, 32, 11].

Transition modification -

In multi-goal settings, Hindsight Experience Replay (HER) proposes to reinterpret trajectories collected with a given target goal with respect to a different goal [1]. In practice, HER modifies transitions by substituting target goals with one of the outcomes achieved later in the trajectory, as well as the corresponding reward . By explicitly biasing goal substitution to increase the probability of sampling rewarded transitions, HER shifts the training data distribution from simpler goals (achieved now) towards more complex goals as the agent makes progress. Substitute goal selection can be guided by other ACL mechanisms (e.g. favoring diversity [17, 9]).

5 What Does ACL Optimize?

Objectives such as the average performance on a set of testing tasks after training episodes can be difficult to optimize directly. To alleviate this difficulty, ACL methods use a variety of surrogate objectives.

Reward -

As DRL algorithms learn from reward signals, rewarded transitions are usually considered as more informative than others, especially in sparse reward problems. In such problems, ACL methods that act on transition selection may artificially increase the ratio of high versus low rewards in the batches of transitions used for policy updates [40, 29, 11]. In multi-goal RL settings where some goals might be much harder than others, this strategy can be used to balance the proportion of positive rewards for each of the goals [12, 32]. Transition modification methods favor rewards as well, substituting goals to increase the probability of observing rewarded transitions [1, 9, 32, 11]. In data collection however, adapting training distributions towards more rewarded experience leads the agent to focus on tasks that are already solved. Because collecting data from already solved tasks hinders learning, data collection ACL methods rather focus on other surrogate objectives.

Intermediate difficulty -

A more natural surrogate objective for data collection is intermediate difficulty. Intuitively, agents should target tasks that are neither too easy (already solved) nor too difficult (unsolvable) to maximize their learning progress. Intermediate difficulty has been used to adapt the distribution of initial states from which to perform a hard task [19, 50, 27]. This objective is also implemented in GoalGAN, where a curriculum generator based on a Generative Adversarial Network is trained to propose goals for which the agent reaches intermediate performance [18]. settersolver settersolver further introduced a judge network trained to predict the feasibility of a given goal for the current learner. Instead of labelling tasks with an intermediate level of difficulty as in GoalGAN, this Setter-Solver model generates goals associated to a random feasibility uniformly sampled from . The type of goals varies as the agent progresses, but the agent is always asked to perform goals sampled from a distribution balanced in terms of feasibility. In asymetricSP asymetricSP, tasks are generated by an RL policy trained to propose either goals or initial states so that the resulting navigation task is of intermediate difficulty w.r.t. the current agent. Intermediate difficulty ACL has also been driving successes in Sim2Real applications, where it sequences domain randomizations to train policies that are robust enough to generalize from simulators to real-world robots [37, 41]. OpenAI2019SolvingRC OpenAI2019SolvingRC trains a robotic hand control policy to solve a Rubik’s cube by automatically adjusting the task distribution so that the agent achieves decent performance while still being challenged.

Learning progress -

The objective of ACL methods can be seen as the maximization of a global learning progress: the difference between the final score and the initial score . This global learning progress is difficult to optimize as the impact of each task selection may not be easily traced to the final score after episodes, especially when is large. Instead, one can use measures of competence learning progress (LP) localized in space and time, as in earlier developmental robotics works [5, 20]. This follows the intuition that maximizing LP here and now (modulo some exploration required to measure LP) will eventually result in maximizing global, long-term LP. In multi-task or multi-goal settings, the agent first focuses on tasks/goals where it learns the most and then moves towards more difficult tasks after the earlier tasks have been mastered (i.e. when ). Intermediate difficulty can be seen as a proxy for an expected LP, but might get stuck in areas of the task space where the agent achieves intermediate scores but cannot improve.

LP maximization is usually framed as a multi-armed bandit (MAB) problem where tasks are arms and LP measures are associated values. Maximizing LP values was shown optimal under the assumption of concave learning profiles [35]

. Both tscl tscl and tscllike tscllike measure LP as the estimated derivative of the performance for each task in a discrete set (Minecraft mazes and Sonic the Hedgehog levels respectively) and apply a MAB algorithm to automatically build a curriculum for their learning agents. In a similar way, CURIOUS uses LP to select goal spaces to sample from in a simulated robotic arm setup 

[12]. There, LP is also used to bias the sampling of transition used for policy updates towards high-LP goals. ALP-GMM uses LP to organize the presentation of procedurally-generated Bipedal-Walker environments sampled from a continuous task space through a stochastic parameterization [46]

. Based on pairs of task parameters and their associated LP scores previously collected, ALP-GMM fits a Gaussian Mixture Model (GMM) and samples task parameters from a Gaussian selected proportionally to its mean LP. LP can also be used to guide the choice of accuracy requirements in a reaching task 

[21], or to train a replay policy via RL to sample transitions for policy updates [59].

Diversity -

Some ACL methods choose to maximize measures of diversity (also called novelty or low density). In multi-goal settings for example, ACL might favor goals from low-density areas either as targets [45] or as substitute goals for data exploitation [17]. Similarly, zhao2019curiosity zhao2019curiosity biases the sampling of trajectories falling into low density areas of the trajectory space. In single-task RL, count-based approaches introduce internal reward functions as decreasing functions of the state visitation count, guiding agent towards rarely visited areas of the state space [6]

. Through a variational expectation-maximization framework, metarl-carml metarl-carml propose to alternatively update a latent skill representation from experimental data (as in diayn diayn) and to meta-learn a policy to adapt quickly to tasks constructed by deriving a reward function from sampled skills. Other algorithms do not optimize directly for diversity but use heuristics to maintain it. For instance, ALP-GMM maintains exploration by using a residual uniform task sampling

[46] and openaiSumos openaiSumos sample opponents from past versions of different policies to maintain diversity.

Surprise -

Some ACL methods train transition models and compute intrinsic rewards based on their prediction errors [42, 8]

or based on the disagreement (variance) between several models from an ensemble 

[53, 43]. The general idea is that models tend to give bad prediction (or disagree) for states rarely visited, thus inducing a bias towards less visited states. However, a model might show high prediction errors on stochastic parts of the environment (TV problem [42]), a phenomenon that does not appear with model disagreement, as all models of the ensemble eventually learn to predict the (same) mean prediction [43]. Other works bias the sampling of transitions for policy update depending on their temporal-difference error (TD-error), i.e. the difference between the transition’s value and its next-step bootstrap estimation [52, 26]. Whether the error computation involves value models or transition models, ACL mechanisms favor states related to maximal surprise, i.e. a maximal difference between the expected (model prediction) and the truth.

Energy -

In the data exploitation phase of multi-goal settings, eb-per eb-per prioritize transitions from high-energy trajectories (e.g. kinetic energy) while curious curious prioritize transitions where the object relevant to the goal moved (e.g. cube movement in a cube pushing task).

Adversarial reward maximization (ARM) -

Self-Play is a form of ACL which optimizes agents’ performance when opposed to current or past versions of themselves, an objective that we call adversarial reward maximization (ARM) [25]. While agents from alpha-go-zero alpha-go-zero and Baker2019HidenSeek Baker2019HidenSeek always oppose copies of themselves, openaiSumos openaiSumos train several policies in parallel and fill a pool of opponents made of current and past versions of all policies. This maintains a diversity of opponents, which helps to fight catastrophic forgetting and to improve robustness. In the multi-agent game Starcraft II, vinyals2019grandmaster vinyals2019grandmaster train three main policies in parallel (one for each of the available player types). They maintain a league of opponents composed of current and past versions of both the three main policies and additional adversary policies. Opponents are not selected at random but to be challenging (as measured by winning rates).

6 Discussion

The bigger picture -

In this survey, we unify the wide range of ACL mechanisms used in symbiosis with DRL under a common framework. ACL mechanisms are used with a particular goal in mind (e.g. organizing exploration, solving hard tasks, etc. § 3). It controls a particular element of task MDPs (e.g. , § 4) and maximizes a surrogate objective to achieve its goal (e.g. diversity, learning progress, § 5). Table 1 organizes the main works surveyed here along these three dimensions. Both previous sections and Table 1 present what has been implemented in the past, and thus, by contrast, highlight potential new avenues for ACL.

Expanding the set of ACL targets -

Inspired by the maturational mechanisms at play in human infants, elman elman proposed to gradually expand the working memory of a recurrent model in a word-to-word natural language processing task. The idea of changing the properties of the agent (here its memory) was also studied in developmental robotics 

[4], policy distillation methods [13, 14] and evolutionary approaches [23] but is absent from the ACL-DRL literature. ACL mechanisms could indeed be used to control the agent’s body (), its action space (how it acts in the world, ), its observation space (how it perceives the world, ), its learning capacities (e.g. capacities of the memory, or the controller) or the way it perceives time (controlling discount factors [22]).

Combining approaches - Many combinations of previously defined ACL mechanisms remain to be investigated. Could we use LP to optimize the selection of opponents in self-play approaches? To drive goal selection in learned goal spaces (e.g. Finot2019 Finot2019, population-based)? Could we train an adversarial domain generator to robustify policies trained for Sim2Real applications?

On the need of systematic ACL studies -

Given the positive impact that ACL mechanisms can have in complex learning scenarios, one can only deplore the lack of comparative studies and standard benchmark environments. Besides, although empirical results advocate for their use, a theoretical understanding of ACL mechanisms is still missing. Although there have been attempts to frame CL in supervised settings [7, 24], more work is needed to see whether such considerations hold in DRL scenarios.

ACL as a step towards open-ended learning agents -

Alan Turing famously said “Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s?”. The idea of starting with a simple machine and to enable it to learn autonomously is the cornerstone of developmental robotics but is rarely considered in DRL [11, 16, 28]. Because they actively organize learning trajectories as a function of the agent’s properties, ACL mechanisms could prove extremely useful in this quest. We could imagine a learning architecture leveraging ACL mechanisms to control many aspects of the learning odyssey, guiding agents from their simple original state towards fully capable agents able to reach a multiplicity of goals. As we saw in this survey, these ACL mechanisms could control the development of the agent’s body and capabilities (motor actions, sensory apparatus), organize the exploratory behavior towards tasks where agents learn the most (maximization of information gain, competence progress) or guide acquisitions of behavioral repertoires.

Algorithm Why using ACL? What does ACL control? What does ACL optimize?
ACL for Data Collection (§ 4.1):
ALP-GMM  [46] Generalization Environments (PCG) LP
ADR (OpenAI)  [41] Generalization Environments (PCG) Intermediate difficulty
ADR (Mila)  [37] Generalization Environments (PCG) Intermediate diff. & Diversity
RC  [19] Hard Task Initial states Intermediate difficulty
-demo RC [50] Hard Task Initial states Intermediate difficulty
BaRC  [27] Hard Task Initial states Intermediate difficulty
Asym. SP  [55] Multi-Goal Goals , initial states Intermediate difficulty
GoalGAN  [18] Multi-Goal Goals Intermediate difficulty
Setter-Solver  [47] Multi-Goal Goals Intermediate difficulty
RgC  [39] Generalization Environments (DS) LP
TSCL  [36] Hard Task Environments (DS) LP
Acc-based CL  [21] Multi-Goal Reward function LP
Skew-fit  [45] Open-Ended Explo. Goals (from pixels) Diversity
DIAYN [16] Open-Ended Explo. Reward functions Diversity
CARML  [28] Open-Ended Explo. Reward functions Diversity
RARL  [44] Generalization Opponents ARM
AlphaGO Zero  [54] Generalization Opponents ARM
Hide&Seek  [2] Generalization Opponents ARM
AlphaStar  [58] Generalization Opponents ARM & Diversity
Competitive SP  [3] Generalization Opponents ARM & Diversity
Count-based  [6] Hard Task Reward functions Diversity
RND  [8] Hard Task Reward functions Surprise (model error)
ICM  [42] Hard Task Reward functions Surprise (model error)
Disagreement  [43] Hard Task Reward functions Surprise (model disagreement)
MAX  [53] Hard Task Reward functions Surprise (model disagreement)
CURIOUS  [12] Multi-goal Goals LP
LE2  [32] Open-Ended Explo. Goals Reward & Diversity
ACL for Data Exploitation (§ 4.2):
HER  [1] Multi-goal Transition modification Reward
HER-curriculum  [17] Multi-goal Transition modification Diversity
Language HER  [9] Multi-goal Transition modification Reward
Prioritized ER  [52] Performance boost Transition selection Surprise (TD-error)
Curiosity Prio.  [61] Multi-goal Transition selection Diversity
En. Based ER  [60] Multi-goal Transition selection Energy
CURIOUS  [12] Multi-goal Trans. select. & mod. LP & Energy
LE2  [32] Open-Ended Explo. Trans. select. & mod. Reward
IMAGINE  [11] Open-Ended Explo. Trans. select. & mod. Reward
Table 1: Classification of the surveyed papers. The classification is organized along the three dimensions defined in the above text. In Why using ACL, we only report the main objective of each work. When ACL controls the selection of environments, we precise whether it is selecting them from a discrete set (DS) or through parametric Procedural Content Generation (PCG). We abbreviate adversarial reward maximization by ARM and learning progress by LP.

References

  • [1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. In NeurIPS, Cited by: §3, §4.2, §5, Table 1.
  • [2] B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, and I. Mordatch (2019) Emergent tool use from multi-agent autocurricula. arXiv. Cited by: §3, §4.1, Table 1.
  • [3] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch (2017) Emergent complexity via multi-agent competition. arXiv. Cited by: §3, §4.1, Table 1.
  • [4] A. Baranes and P. Oudeyer (2011) The interaction of maturational constraints and intrinsic motivations in active motor development. In ICDL, External Links: Document, ISSN 2161-9476 Cited by: §6.
  • [5] A. Baranes and P. Oudeyer (2013) Active learning of inverse models with intrinsically motivated goal exploration in robots. Robot. Auton. Syst.. Cited by: §5.
  • [6] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In NeurIPS, Cited by: §3, §4.1, §5, Table 1.
  • [7] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In ICML, Cited by: §1, §6.
  • [8] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov (2019) Exploration by random network distillation. ICLR. Cited by: §3, §4.1, §5, Table 1.
  • [9] G. Cideron, M. Seurin, F. Strub, and O. Pietquin (2019) Self-educated language agent with hindsight experience replay for instruction following. arXiv. Cited by: §3, §4.2, §5, Table 1.
  • [10] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman (2018) Quantifying generalization in reinforcement learning. arXiv. Cited by: §2.
  • [11] C. Colas, T. Karch, N. Lair, J. Dussoux, M. Clément, P. Ford Dominey, and P. Oudeyer (2020) Language as a cognitive tool to imagine goals in curiosity-driven exploration. arXiv. Cited by: §3, §4.2, §5, §6, Table 1.
  • [12] C. Colas, P. Oudeyer, O. Sigaud, P. Fournier, and M. Chetouani (2019) CURIOUS: intrinsically motivated modular multi-goal reinforcement learning. In ICML, Cited by: §3, §4.1, §4.2, §5, §5, Table 1.
  • [13] W. Czarnecki, S. Jayakumar, M. Jaderberg, L. Hasenclever, Y. W. Teh, N. Heess, S. Osindero, and R. Pascanu (2018) Mix & match agent curricula for reinforcement learning. In ICML, Cited by: §6.
  • [14] W. M. Czarnecki, R. Pascanu, S. Osindero, S. M. Jayakumar, G. Swirszcz, and M. Jaderberg (2019) Distilling policy distillation. arXiv. Cited by: §1, §6.
  • [15] J. L. Elman (1993) Learning and development in neural networks: the importance of starting small. Cognition. Cited by: §1.
  • [16] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine (2018) Diversity is all you need: learning skills without a reward function. arXiv. Cited by: §3, §6, Table 1.
  • [17] M. Fang, T. Zhou, Y. Du, L. Han, and Z. Zhang (2019) Curriculum-guided hindsight experience replay. In NeurIPS, Cited by: §3, §4.2, §5, Table 1.
  • [18] C. Florensa, D. Held, X. Geng, and P. Abbeel (2018) Automatic goal generation for reinforcement learning agents. In ICML, Cited by: §3, §4.1, §5, Table 1.
  • [19] C. Florensa, D. Held, M. Wulfmeier, and P. Abbeel (2017) Reverse curriculum generation for reinforcement learning. CoRL. Cited by: §3, §4.1, §5, Table 1.
  • [20] S. Forestier, Y. Mollard, and P. Oudeyer (2017) Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv. Cited by: §5.
  • [21] P. Fournier, O. Sigaud, M. Chetouani, and P. Oudeyer (2018) Accuracy-based curriculum learning in deep reinforcement learning. arXiv. Cited by: §3, §5, Table 1.
  • [22] V. François-Lavet, R. Fonteneau, and D. Ernst (2015) How to discount deep reinforcement learning: towards new dynamic strategies. arXiv. External Links: Link, 1512.02011 Cited by: §6.
  • [23] D. Ha (2019) Reinforcement learning for improving agent design. Arti. Life. Cited by: §6.
  • [24] G. Hacohen and D. Weinshall (2019) On the power of curriculum learning in training deep networks. In ICML, K. Chaudhuri and R. Salakhutdinov (Eds.), Cited by: §6.
  • [25] D. Hernandez, K. Denamganaï, Y. Gao, P. York, S. Devlin, S. Samothrakis, and J. A. Walker (2019) A generalized framework for self-play training. In IEEE CoG, Cited by: §5.
  • [26] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver (2018) Distributed prioritized experience replay. arXiv. Cited by: §3, §5.
  • [27] B. Ivanovic, J. Harrison, A. Sharma, M. Chen, and M. Pavone (2018) BaRC: backward reachability curriculum for robotic reinforcement learning. ICRA. Cited by: §3, §4.1, §5, Table 1.
  • [28] A. Jabri, K. Hsu, A. Gupta, B. Eysenbach, S. Levine, and C. Finn (2019) Unsupervised curricula for visual meta-reinforcement learning. In NeurIPS, Cited by: §3, §6, Table 1.
  • [29] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu (2016) Reinforcement learning with unsupervised auxiliary tasks. arXiv. Cited by: §5.
  • [30] F. Kaplan and P. Oudeyer (2007) In search of the neural circuits of intrinsic motivation. Frontiers in neuroscience. Cited by: §1.
  • [31] K. A. Krueger and P. Dayan (2009) Flexible shaping: how learning in small steps helps. Cognition. External Links: Document Cited by: §1.
  • [32] N. Lair, C. Colas, R. Portelas, J. Dussoux, P. F. Dominey, and P. Oudeyer (2019) Language grounding through social interactions and curiosity-driven multi-goal learning. arXiv. Cited by: §3, §4.1, §4.2, §5, Table 1.
  • [33] T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. D. Rodríguez (2019) Continual learning for robotics. arXiv. Cited by: §1.
  • [34] L. Lin (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. lear.. Cited by: §4.2.
  • [35] M. Lopes and P. Oudeyer (2012) The strategic student approach for life-long exploration and learning. In ICDL, Cited by: §5.
  • [36] T. Matiisen, A. Oliver, T. Cohen, and J. Schulman (2017) Teacher-student curriculum learning. IEEE TNNLS. Cited by: §3, §4.1, Table 1.
  • [37] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull (2019) Active domain randomization. CoRL. Cited by: §3, §4.1, §5, Table 1.
  • [38] A. W. Moore and C. G. Atkeson (1993) Prioritized sweeping: reinforcement learning with less data and less time. Mach. learn.. Cited by: §4.2.
  • [39] S. Mysore, R. Platt, and K. Saenko (2018) Reward-guided curriculum for robust reinforcement learning. preprint. Cited by: §4.1, Table 1.
  • [40] K. Narasimhan, T. Kulkarni, and R. Barzilay (2015) Language understanding for text-based games using deep reinforcement learning. arXiv. Cited by: §5.
  • [41] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang (2019) Solving rubik’s cube with a robot hand. arXiv. Cited by: §3, §4.1, §5, Table 1.
  • [42] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In CVPR, Cited by: §3, §4.1, §5, Table 1.
  • [43] D. Pathak, D. Gandhi, and A. Gupta (2019) Self-supervised exploration via disagreement. arXiv. Cited by: §3, §4.1, §5, Table 1.
  • [44] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. arXiv. Cited by: §3, §4.1, Table 1.
  • [45] V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine (2019) Skew-fit: state-covering self-supervised reinforcement learning. arXiv. External Links: Link, 1903.03698 Cited by: §3, §4.1, §5, Table 1.
  • [46] R. Portelas, C. Colas, K. Hofmann, and P. Oudeyer (2019) Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments. CoRL. Cited by: §3, §4.1, §5, §5, Table 1.
  • [47] S. Racanière, A. Lampinen, A. Santoro, D. Reichert, V. Firoiu, and T. Lillicrap (2019) Automated curricula through setter-solver interactions. arXiv. Cited by: §3, §4.1, Table 1.
  • [48] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Van de Wiele, V. Mnih, N. Heess, and J. T. Springenberg (2018) Learning by playing-solving sparse reward tasks from scratch. arXiv. Cited by: §3.
  • [49] S. Risi and J. Togelius (2019) Procedural content generation: from automatically generating game levels to increasing generality in machine learning. arXiv. Cited by: §4.1.
  • [50] T. Salimans and R. Chen (2018) Learning montezuma’s revenge from a single demonstration. NeurIPS. Cited by: §3, §4.1, §5, Table 1.
  • [51] T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In ICML, Cited by: §2.
  • [52] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2015) Prioritized experience replay. arXiv. Cited by: §3, §4.2, §5, Table 1.
  • [53] P. Shyam, W. Jaśkowski, and F. Gomez (2018) Model-based active exploration. arXiv. Cited by: §3, §4.1, §5, Table 1.
  • [54] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis (2017) Mastering the game of go without human knowledge. Nature. External Links: Link Cited by: §3, §4.1, Table 1.
  • [55] S. Sukhbaatar, I. Kostrikov, A. Szlam, and R. Fergus (2017) Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv. External Links: Link Cited by: §3, §4.1, Table 1.
  • [56] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.
  • [57] M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. JMLR. Cited by: §1.
  • [58] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature. External Links: ISSN 0028-0836, Document, Link Cited by: §3, §4.1, Table 1.
  • [59] D. Zha, K. Lai, K. Zhou, and X. Hu (2019) Experience replay optimization. arXiv. Cited by: §5.
  • [60] R. Zhao and V. Tresp (2018) Energy-based hindsight experience prioritization. arXiv. Cited by: §3, §4.2, Table 1.
  • [61] R. Zhao and V. Tresp (2019) Curiosity-driven experience prioritization via density estimation. arXiv. Cited by: §3, §4.2, Table 1.