Trying AGAIN instead of Trying Longer: Prior Learning for Automatic Curriculum Learning

04/07/2020 ∙ by Rémy Portelas, et al. ∙ Inria 1

A major challenge in the Deep RL (DRL) community is to train agents able to generalize over unseen situations, which is often approached by training them on a diversity of tasks (or environments). A powerful method to foster diversity is to procedurally generate tasks by sampling their parameters from a multi-dimensional distribution, enabling in particular to propose a different task for each training episode. In practice, to get the high diversity of training tasks necessary for generalization, one has to use complex procedural generation systems. With such generators, it is hard to get prior knowledge on the subset of tasks that are actually learnable at all (many generated tasks may be unlearnable), what is their relative difficulty and what is the most efficient task distribution ordering for training. A typical solution in such cases is to rely on some form of Automated Curriculum Learning (ACL) to adapt the sampling distribution. One limit of current approaches is their need to explore the task space to detect progress niches over time, which leads to a loss of time. Additionally, we hypothesize that the induced noise in the training data may impair the performances of brittle DRL learners. We address this problem by proposing a two stage ACL approach where 1) a teacher algorithm first learns to train a DRL agent with a high-exploration curriculum, and then 2) distills learned priors from the first run to generate an "expert curriculum" to re-train the same agent from scratch. Besides demonstrating 50 improvements on average over the current state of the art, the objective of this work is to give a first example of a new research direction oriented towards refining ACL techniques over multiple learners, which we call Classroom Teaching.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic CL.

The idea of organizing the learning sequence of a machine is an old concept that stems from multiple works in reinforcement learning

(Selfridge et al., 1985; Schmidhuber, 1991), developmental robotics (Oudeyer et al., 2007)

and supervised learning

(Elman, 1993; Bengio et al., 2009), from which the Deep RL community borrowed the term Curriculum Learning (CL). Automatic CL refers to approaches able to autonomously adapt their task sampling distribution to their evolving learner with minimal expert knowledge. Several ACL approaches have recently been proposed (Florensa et al., 2018; Racanière et al., 2019; OpenAI et al., 2019; Portelas et al., 2019; Colas et al., 2019; Pong et al., 2019; Jabri et al., 2019; Laversanne-Finot et al., 2018).

ACL and exploration.

One of the limits of ACL is that when applied to a large parameterized task space with few learnable subspaces, as when considering a rich procedural generation system, they loose a lot of time finding the ”optimal parameters” at a given point in time (e.g. the niches of progress in Learning Progress-based approaches) through task exploration. We also hypothesize that these additional tasks presented to the DRL learner have a cluttering effect on the gathered training data, which adds noise in its already brittle gradient-based optimization and leads to sub-optimal performances.

Proposed approach.

Given this hypothesized drawback of task exploration, we propose to study whether ACL techniques could be improved by having a two stage approach consisting in 1) a preliminary run with ACL from which prior knowledge on the task space is extracted, and 2) a second independent run leveraging this prior knowledge to propose a better curriculum to the DRL agent. For its simplicity and versatility, we choose to develop such an approach with ALP-GMM (Portelas et al., 2019), a recent ACL algorithm for continuous task spaces.

Related work.

Within DRL, Policy Distillation (Czarnecki et al., 2019) consists in leveraging a previously trained policy, the ”teacher”, and use it to perform behavior cloning by training a ”student” policy to jointly maximize its reward on one or several tasks while minimizing the distance between its action distribution compared to the teacher’s. This allows to speed up the learning of bigger architectures and/or to leverage task-experts to train a single learner on a set of tasks. From this point of view, this work can be seen as a complementary approach interested in how to perform Curriculum Distillation when considering a continuous space of tasks.

Similar ideas where developed for supervised learning by Hacohen and Weinshall (2019). In their work, authors propose an approach to infer a curriculum from past training for an image classification task: they use a first network trained without curriculum and use its predictive confidence for each image as a difficulty measure that a subsequently trained network uses for curriculum generation. The idea that a knowledge distillation procedure can be beneficial even when the teacher and student policies have identical architectures has also been studied in supervised learning (Furlanello et al., 2018; Yim et al., 2017). In this work, we propose to extend these concepts to DRL scenarios.

2 Methods

(a)
(b)
Figure 1: Left: Schematic pipeline of Alp-Gmm And Inferred progress Niches (AGAIN), our proposed approach, which first leverages a preliminary run with a high-exploration ALP-GMM curriculum generator to infer an expert curriculum combined with a low-exploration ALP-GMM in a second run with the initial Deep RL learner, re-trained from scratch. Right: Given identical training time, our combined approach outperforms regular ALP-GMM and even matches Oracle curriculum in a complex parametric BipedalWalker environment.

Alp-Gmm

ALP-GMM (Portelas et al., 2019) is a Learning Progress (LP) based ACL technique for continuous task spaces that does not assume prior knowledge on the task space. It is inspired by previous works in developmental robotics (Baranes and Oudeyer, 2009; Moulin-Frier et al., 2014). ALP-GMM frames the task sampling problem into an EXP-4 non-stationary Multi-Armed bandit setup (Auer et al., 2002)

in which arms are Gaussians spanning over the space of tasks’ parameters whose utility is defined with a local LP measure. The essence of ALP-GMM is to periodically fit a Gaussian Mixture Model (GMM) on recently sampled tasks’ parameters

concatenated with their respective LP

. Then, the Gaussian from which to sample a new task is chosen proportionally to its mean LP dimension. Task exploration happens initially through a bootstrapping period of random task sampling and during training by occasional random task sampling with probability

.

Inferred progress Niches (IN)

Using ALP-GMM is convenient for our target experiments as deriving an expert curriculum from an initial run is straightforward: one simply needs to gather the sequence of GMMs that were periodically fitted along training:

(1)

with the total number of GMMs in the list, their respective number of components and the Learning Progress of each Gaussian. By keeping only Gaussians with above a predefined threshold , we can get a curated list . We name the resulting ACL approach Infered progress Niches (IN) and propose variants to select which GMM from is used to sample tasks over time during the second run:

  • Pool-based (IN-P) A rather crude approach is to disregard the ordering of and consider the entire trajectory of GMMs as one single pool of Gaussians, ie. one big mixture having components.

  • Time-based (IN-T) In this version is stepped in periodically at the same rate than the preliminary ALP-GMM run (ie. a new GMM every episodes in our experiments).

  • Reward-based (IN-R) Another option is to iterate over only once the mean episodic reward over tasks recently sampled from the current GMM matches or surpasses the mean episodic reward recorded during the initial run (on the same GMM).

Regardless the selection process, given a GMM, a new task is selected by sampling a tasks’ parameter on a Gaussian selected proportionally to its value.

Mixing both in AGAIN.

Simply using one of the proposed IN algorithms directly for the second run lacks adaptive mechanisms towards the characteristics of the second agent, whose initial parameters and data training stream are different from the first, which could lead to failure cases where the expert curriculum and the second learner are no longer ”in phase”. Additionally, if the initial run failed to discover progress niches, IN is bound to fail. As such we propose to combine IN with an ALP-GMM teacher (with low task-exploration) in the second run. The resulting Alp-Gmm And Inferred progress Niches approach, AGAIN for short, samples tasks from a GMM that is composed of the current mixture of both ALP-GMM and IN. See figure 1 for a schematic pipeline and appendix B for details.

3 Experiments and Results

Evaluation procedure.

We propose to test our considered variants and baselines on a parametric version of BipedalWalker proposed by Portelas et al. (2019), which generates walking tracks paved with stumps whose height and spacing are defined by a

-D parameter vector used for the procedural generation of tasks. This continuous task space has boundaries set in such a way that a substantial part of the space consists in unfeasible tracks for the default walker. As in their work, we also test our approaches with a modified short-legged walker, which constitutes an even more challenging scenario (as the task space is unchanged). All ACL variants are tested when paired with a Soft-Actor Critic

(Haarnoja et al., 2018) policy. Performance is measured by tracking the percentage of mastered tasks from a fixed test set. See appendix C for details.

Is re-training from scratch beneficial?

The end performances of all tested conditions are summarized in table LABEL:results-table. Interestingly, for all tested variants, retraining the DRL agent from scratch in the second run gave superior end performances than fine-tuning using the weights of the first run in all tested variants

. This showcase the brittleness of gradient-based training and the difficulty of transfer learning. Despite this, even fine-tuned variants reached superior end-performances than classical ALP-GMM, meaning that the change in curriculum strategy in itself is already beneficial.

Is it useful to re-use ALP-GMM in the second run?

In the default walker experiments, AGAIN-R, T and P conditions mixing ALP-GMM and IN in the second run reached lower mean performances than their respective IN variants. However, the exact opposite is observed for IN-R and IN-T variants in the short walker experiments. This can be explained by the difficulty of short walker experiments for ACL approaches, leading to preliminary 10M steps long ALP-GMM runs to have a mean end-performance of , compared to in the default walker experiments. All these run failures led to many GMMs lists used in IN to be of very low-quality, which illustrates the advantage of AGAIN that is able to emancipate from IN using ALP-GMM.

Highest-performing variants.

Consistently with the precedent analysis, mixing ALP-GMM with IN in the second run is not essential in default walker experiments, as the best performing ACL approach is IN-P. This most likely suggests that the improved adaptability of the curriculum when using AGAIN is outbalanced by the added noise (due to the low task-exploration). However in the more complex short walker experiments, mixing ALP-GMM with IN is essential, especially for AGAIN-R, which substantially outperforms ALP-GMM and other AGAIN and IN variants, reaching a mean end performance of . The difference in end-performance between AGAIN-R and Oracle, our hand-made expert using privileged information who obtained , is not statistically significant ().

Condition Short walker Default walker
AGAIN-R
AGAIN-R(fine-tune)
IN-R
IN-R(fine-tune)
AGAIN-T
AGAIN-T(fine-tune)
IN-T
IN-T(fine-tune)
AGAIN-P
AGAIN-P(fine-tune)
IN-P
IN-P(fine-tune)
ALP-GMM
Oracle
Random

Table 1: Experiments on Stump Tracks with short and default bipedal walkers.

The average performance with standard deviation after 10 Millions steps (IN and AGAIN variants) or 20 Million steps (others) is reported (30 seeds per condition). For IN and AGAIN we also test variants that do not retrain the weights of the policy used in the second run

from scratch but rather fine-tune them from the preliminary run. Indicates whether performance difference with ALP-GMM is statistically significant ie.

in a post-training Welch’s student t-test (

for performance advantage w.r.t ALP-GMM and for performance disadvantage).

4 Conclusion and Discussion

In this work we presented Alp-Gmm And Inferred progress Niches, a simple yet effective approach to learn prior knowledge over a space of tasks to design a curriculum tailored to a DRL agent. Instead of following the same exploratory ACL approach over the entire training, AGAIN performs a first preliminary run with ALP-GMM, derives a list of progress niches from it, and uses this list to build an expert curriculum that is combined with a low task-exploration ALP-GMM teacher for a second run of the same DRL agent, trained from scratch.

Beyond tabula rasa?

In this work we showed that a non-tabula rasa curriculum generator that leveraged prior knowledge over the task space (from a preliminary run) outperformed the regular approach that learned to generate an entire curriculum from scratch. However, we also demonstrated that, from the point of view of the DRL learner, it is actually better to restart tabula rasa (with a non-tabula rasa curriculum generator), which is a very interesting perspective and opens several lines for future work.

Classroom Teaching

Beyond proposing a two-stage ACL technique for a single DRL agent, the experimental setup of this work could be seen as a particular case of a broader problem we propose to name Classroom Teaching (CT). CT defines a family of problems in which a meta-ACL algorithm is tasked to either sequentially or simultaneously generate multiple curricula tailored for each of the learning students, all having potentially varying abilities. CT differs from the problems studied in population-based developmental robotics (Forestier et al., 2017)

and evolutionary algorithms

(Wang et al., 2019) as in CT the number and characteristics of learners are predefined, and the objective is to foster maximal learning progress over all learners rather than iteratively constructing high-performing policies. Studying CT scenarios brings DRL closer to human education research problems and might stimulate the design of methods that alleviate the expensive use of expert knowledge in current state of the art assisted education (Clément et al., 2015; Koedinger et al., 2013).

References

  • P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §2.
  • A. Baranes and P. Oudeyer (2009)

    R-IAC: robust intrinsically motivated exploration and active learning

    .
    IEEE Trans. Autonomous Mental Development 1 (3), pp. 155–169. External Links: Document Cited by: §2.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009

    ,
    pp. 41–48. External Links: Document Cited by: §1.
  • H. Bozdogan (1987) Model selection and akaike’s information criterion (aic): the general theory and its analytical extensions. Psychometrika 52 (3), pp. 345–370. External Links: ISSN 1860-0980, Document Cited by: Appendix A.
  • B. Clément, D. Roy, P. Oudeyer, and M. Lopes (2015) Multi-Armed Bandits for Intelligent Tutoring Systems. Journal of Educational Data Mining (JEDM) 7 (2), pp. 20–48. Cited by: §4.
  • C. Colas, P. Oudeyer, O. Sigaud, P. Fournier, and M. Chetouani (2019) CURIOUS: intrinsically motivated modular multi-goal reinforcement learning. In International Conference on Machine Learning, pp. 1331–1340. Cited by: §1.
  • W. M. Czarnecki, R. Pascanu, S. Osindero, S. M. Jayakumar, G. Swirszcz, and M. Jaderberg (2019) Distilling policy distillation. CoRR abs/1902.02186. External Links: 1902.02186 Cited by: §1.
  • J. L. Elman (1993)

    Learning and development in neural networks: the importance of starting small

    .
    Cognition 48 (1), pp. 71 – 99. External Links: ISSN 0010-0277, Document Cited by: §1.
  • C. Florensa, D. Held, X. Geng, and P. Abbeel (2018) Automatic goal generation for reinforcement learning agents. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1514–1523. Cited by: §1.
  • S. Forestier, Y. Mollard, and P. Oudeyer (2017) Intrinsically motivated goal exploration processes with automatic curriculum learning. CoRR abs/1708.02190. External Links: 1708.02190 Cited by: §4.
  • T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born-again neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1602–1611. Cited by: §1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR abs/1801.01290. External Links: 1801.01290 Cited by: §3.
  • G. Hacohen and D. Weinshall (2019) On the power of curriculum learning in training deep networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 2535–2544. Cited by: §1.
  • A. Jabri, K. Hsu, A. Gupta, B. Eysenbach, S. Levine, and C. Finn (2019) Unsupervised curricula for visual meta-reinforcement learning. In Advances in Neural Information Processing Systems 32, pp. 10519–10530. Cited by: §1.
  • K. R. Koedinger, E. Brunskill, R. S. J. de Baker, E. A. McLaughlin, and J. C. Stamper (2013) New potentials for data-driven intelligent tutoring system development and optimization.. AI Magazine 34 (3), pp. 27–41. Cited by: §4.
  • A. Laversanne-Finot, A. Pere, and P. Oudeyer (2018) Curiosity driven exploration of learned disentangled goal spaces. In Proceedings of The 2nd Conference on Robot Learning, Proceedings of Machine Learning Research, Vol. 87, , pp. 487–504. Cited by: §1.
  • C. Moulin-Frier, S. M. Nguyen, and P. Oudeyer (2014) Self-organization of early vocal development in infants and machines: the role of intrinsic motivation. Frontiers in Psychology (Cognitive Science) 4 (1006). External Links: Document, ISSN 1664-1078 Cited by: §2.
  • OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang (2019) Solving rubik’s cube with a robot hand. ArXiv abs/1910.07113. Cited by: §1.
  • P. Oudeyer, F. Kaplan, and V. V. Hafner (2007) Intrinsic motivation systems for autonomous mental development.

    IEEE transactions on evolutionary computation

    11 (2), pp. 265–286.
    Cited by: §1.
  • V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine (2019) Skew-fit: state-covering self-supervised reinforcement learning. CoRR abs/1903.03698. External Links: 1903.03698 Cited by: §1.
  • R. Portelas, C. Colas, K. Hofmann, and P. Oudeyer (2019) Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments. External Links: 1910.07224 Cited by: Figure 2, Appendix A, Appendix C, Appendix C, §1, §1, §2, §3.
  • S. Racanière, A. Lampinen, A. Santoro, D. Reichert, V. Firoiu, and T. Lillicrap (2019) Automated curricula through setter-solver interactions. arXiv preprint arXiv:1909.12892. Cited by: §1.
  • J. Schmidhuber (1991) Curious model-building control systems. In In Proc. International Joint Conference on Neural Networks, Singapore, pp. 1458–1463. Cited by: §1.
  • O. G. Selfridge, R. S. Sutton, and A. G. Barto (1985) Training and tracking in robotics. In

    Proceedings of the 9th International Joint Conference on Artificial Intelligence. Los Angeles, CA, USA, August 1985

    ,
    pp. 670–672. Cited by: §1.
  • R. Wang, J. Lehman, J. Clune, and K. O. Stanley (2019) Paired open-ended trailblazer (POET): endlessly generating increasingly complex and diverse learning environments and their solutions. CoRR abs/1901.01753. External Links: 1901.01753 Cited by: Appendix C, §4.
  • J. Yim, D. Joo, J. Bae, and J. Kim (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 7130–7138.
    Cited by: §1.

Appendix A Alp-Gmm

ALP-GMM relies on an empirical per-task computation of Absolute Learning Progress (ALP), allowing to fit a GMM on a concatenated space composed of tasks’ parameters and respective ALP. Given a task whose parameter is and on which the policy collected the episodic reward , Its ALP is computed using the closest previous tasks (Euclidean distance) with associated episodic reward :

(2)

All previously encountered task’s parameters and their associated ALP, parameter-ALP for short, recorded in a history database , are used for this computation. Contrastingly, the fitting of the GMM is performed every episodes on a window containing the most recent parameter-ALP. The resulting mean ALP dimension of each Gaussian of the GMM is used for proportional sampling. To adapt the number of components of the GMM online, a batch of GMMs having from 2 to components is fitted on , and the best one, according to Akaike’s Information Criterion (Bozdogan, 1987)

, is kept as the new GMM. In our experiments we use the same hyperparameters as in

Portelas et al. (2019) (, ), except for the percentage of random task sampling which we set to (we found it to perform better than ) when running ALP-GMM alone or when combined with IN in the second phase of AGAIN. See algorithm 1 for pseudo-code and figure 2 for a schematic pipeline. Note that in this paper we refer to ALP as LP for simplicity (ie. in from eq. 1 is equivalent to the mean ALP of Gaussians in ALP-GMM).

1:Student policy , parametric procedural environment generator , bounded parameter space , probability of random sampling , fitting rate , max number of Gaussians
2:Initialize
3:Initialize parameter-ALP First-in-First-Out window , set max size to
4:Initialize parameter-reward history database
5:loop  times Bootstrap phase
6:     Sample random , send to , observe episodic reward
7:     Compute ALP of based on and (see equation 2)
8:     Store pair in , store pair in
9:loop Stop after inner loops
10:     Fit a set of GMM having 2 to kernels on
11:     Select the GMM with best Akaike Information Criterion
12:     loop  times
13:          of the time, sample a random parameter
14:         Else, sample from a Gaussian chosen proportionally to its mean ALP value
15:         Send to student and observe episodic reward
16:         Compute ALP of based on and
17:         Store pair in , store pair in      
18:Return
Algorithm 1   Absolute Learning Progress Gaussian Mixture Model (ALP-GMM)
Figure 2: Schematic view of an ALP-GMM teacher’s workflow from Portelas et al. (2019)

Appendix B Again

IN variants.

In order to filter the list (see eq. 1) of GMMs collected after a preliminary run of ALP-GMM into and use it as an expert curriculum, we remove any Gaussian with a below (the LP dimension is normalized between and , which requires to choose an approximate potential reward range, set to for all experiments). When all Gaussians of a GMM are discarded, the GMM is removed from . In practice, it allows to remove non-informative GMMs corresponding to the initial exploration phase of ALP-GMM, when the learner has not made any progress (hence no LP detected by the curriculum generator). is then iterated over to generate a curricula with either of the Time-based (see algo 2), Pool-based (see algo 3) or Reward-based (see algo 4) IN. The IN-P approach does not require additional hyperparameters. The IN-T requires an update rate to iterate over , which we set to (same as the fitting rate of ALP-GMM). The IN-R approach requires to extract additional data from the first run, in the form of a list :

(3)

with T the total number of GMMs in the first run (same as in ), and the mean episodic reward obtained by the first DRL agent during the last tasks sampled from the GMM. is simply obtained by removing any that corresponds to a GMM discarded while extracting from . The remaining rewards are then used as thresholds in IN-R to decide when to switch to the next GMM in .

Again

In AGAIN (see algo. 5), the idea is to use both IN (R,T or P) and ALP-GMM (without the random bootstrapping period) for curriculum generation. We combine the changing GMM of IN and ALP-GMM over time, simply by building a GMM containing Gaussians from the current GMM of IN and ALP-GMM. By selecting the Gaussian in from which to sample a new task using their respective LP, This approach allows to adaptively modulate the task sampling between both, shifting the sampling towards IN when ALP-GMM does not detect high-LP subspaces and towards ALP-GMM when the current GMM of IN has low-LP Gaussians. Additionally, to have minimal task-exploration, which benefits ALP-GMM (allowing it to detect new progress niches), we sample random tasks with probability (compared with used for the preliminary ALP-GMM run).

1:Student policy , task-encoding parameter space , LP threshold , update rate , experimental budget , random sampling ratio , parametric procedural environment generator
2:Init , train it for env. steps with ALP-GMM() and collect First run
3:Get from by removing any Gaussian with .
4:Re-initialize Second run
5:Initialize expert curriculum index to
6:loop  Stop after environment steps
7:     Set to
8:     Set current GMM to GMM in
9:     loop  times
10:         Sample from a Gaussian in chosen proportionally to its
11:         Send to student      
12:Return
Algorithm 2   Inferred progress Niches - Time-based (IN-T)
1:Student policy , task-encoding parameter space , LP threshold , experimental budget , random sampling ratio , parametric procedural environment generator
2:Init , train it for env. steps with ALP-GMM() and collect First run
3:Get from by removing any Gaussian with .
4:Re-initialize Second run
5:Initialize pool GMM , containing all Gaussians from
6:loop  Stop after environment steps
7:     Sample from a Gaussian in chosen proportionally to its
8:     Send to student
9:Return
Algorithm 3   Inferred progress Niches - Pool-based (IN-P)
1:Student policy , task-encoding parameter space , LP threshold , memory size , experimental budget , random sampling ratio , parametric procedural environment generator
2:Init , train it for env. steps with ALP-GMM() and collect First run
3:Get from by removing any Gaussian with .
4:Additionally, collect list of inferred reward thresholds , and get See eq. 3
5:Re-initialize Second run
6:Initialize reward First-in-First-Out window , set max size to
7:Initialize expert curriculum index to
8:loop  Stop after environment steps
9:     If is full, compute mean reward from
10:         If superior to reward threshold in , set to
11:     Set current GMM to GMM in
12:     Sample from a Gaussian in chosen proportionally to its
13:     Send to student and add episodic reward to
14:Return
Algorithm 4   Inferred progress Niches - Reward-based (IN-R)
1:Student policy , task-encoding parameter space , random sampling ratio and , LP threshold , experimental budget , parametric procedural environment generator
2:Init , train it for env. steps with ALP-GMM() and collect First run
3:Get from by removing any Gaussian with
4:re-initialize Second run
5:Setup new ALP-GMM() See algo. 1
6:Setup either IN-T, IN-P or IN-R See algo. 2, 3 and 4
7:loop Stop after environment steps
8:     Get composite GMM from the current GMM of both ALP-GMM and IN
9:      of the time, sample a random parameter
10:     Else, sample from a Gaussian chosen proportionally to its
11:     Send to student and observe episodic reward
12:     Send pair to both ALP-GMM and IN
13:Return
Algorithm 5   Alp-Gmm And Inferred progress Niches (AGAIN)

Appendix C Experimental details

Soft Actor-Critic

In our experiments, we use an implementation of Soft Actor-Critic provided by OpenAI111https://github.com/openai/spinningup. We use a layered (,) network for V, Q1, Q2 and the policy. Gradient steps are performed each environment steps, with a learning rate of and a batch size of . The entropy coefficient is set to .

Parametric BipedalWalker

Our proposed ACL variants choose parameters of tasks that encode the procedural generation of walking tracks paved with stumps in the BipedalWalker environments. As in Portelas et al. (2019), we bound the height dimension to and the spacing dimension to (regardless the walker morphology). The agent is rewarded for keeping its head straight and going forward and is penalized for torque usage. The episode is terminated after 1) reaching the end of the track, 2) reaching a maximal number of steps, or 3) head collision (for which the agent receives a strong penalty). See figure 3 for visualizations.

Baselines

The Random curriculum baseline samples tasks’ parameters randomly over the parameter space. The Oracle condition is a hand-made curriculum that is very similar to IN-R, except that the list is built using expert knowledge, and all reward thresholds in are set to , which is an episodic reward value often used in the literature as characterizing a default walker having a ”reasonably efficient” walking gate (Wang et al., 2019). Basically, Oracle starts proposing tasks from a Gaussian (with std of ) located at the simplest subspace of the task space (ie. low stump height and high stump spacing) and then gradually moves the Gaussian towards the hardest subspaces (high stump height and low stump spacing) by small increments ( steps overall) happening whenever the mean episodic reward of the DRL agent over the last proposed tasks is superior to . In our experiments, consistently with (Portelas et al., 2019), which implements a similar approach, Oracle is prone to forgetting due to the strong shift in task subspace (which is why it is not the best performing condition for default walker experiments (see table LABEL:results-table).

Computational resources.

To perform our experiments, we ran each condition for either (IN and AGAIN variants) or (others) Millions environment steps ( repeats) using one cpu and one GPU (the GPU is shared between runs), for approximately 30 hours of wall-clock time. It amounts to CPU-hours and GPU-hours. The preliminary ALP-GMM runs used in IN and AGAIN variants correspond to the first Million steps of the ALP-GMM condition (whose end-performance after Million steps is reported in table LABEL:results-table.

Figure 3: Parameterized BipedalWalker environment. Left: Examples of generated tracks. Right: The two walker morphologies tested on the environment. One parameter tuple () actually encodes a distribution of tasks as the height of each stump along the track is drawn from .

Appendix D Additional Visualizations

(a) with Pool-based IN
(b) with Time-based IN
(c) with Reward-based IN
Figure 4: Box plots of the final performance of each condition with default bipedal walker after 20M environment steps.

Gold lines are medians, surrounded by a box showing the first and third quartile, which are then followed by whiskers extending to the last datapoint or

times the inter-quartile range. Beyond the whiskers are outlier datapoints. From top to bottom, each box plot presents results when using either Pool-based, Reward-based, or Time-based IN, respectively.

(a) with Pool-based IN
(b) with Time-based IN
(c) with Reward-based IN
Figure 5: Box plots of the final performance of each condition with short bipedal walker after 20M environment steps. Gold lines are medians, surrounded by a box showing the first and third quartile, which are then followed by whiskers extending to the last datapoint or times the inter-quartile range. Beyond the whiskers are outlier datapoints. From top to bottom, each box plot presents results when using either Pool-based, Reward-based, or Time-based IN, respectively.
(a) Pool-based IN
(b) Time-based IN
(c) Reward-based IN
Figure 6: Evolution of performance across 20M environment steps of each condition with default bipedal walker. Each point in each curve corresponds to the mean performance (30 seeds), defined as the percentage of mastered tracks (ie.

) on a fixed test set. Shaded areas represent the standard error of the mean.

(a) Pool-based IN
(b) Time-based IN
(c) Reward-based IN
Figure 7: Evolution of performance across 20M environment steps of each condition with short bipedal walker. Each point in each curve corresponds to the mean performance (30 seeds), defined as the percentage of mastered tracks (ie. ) on a fixed test set. Shaded areas represent the standard error of the mean.