teachDeepRL
None
view repo
We consider the problem of how a teacher algorithm can enable an unknown Deep Reinforcement Learning (DRL) student to become good at a skill over a wide range of diverse environments. To do so, we study how a teacher algorithm can learn to generate a learning curriculum, whereby it sequentially samples parameters controlling a stochastic procedural generation of environments. Because it does not initially know the capacities of its student, a key challenge for the teacher is to discover which environments are easy, difficult or unlearnable, and in what order to propose them to maximize the efficiency of learning over the learnable ones. To achieve this, this problem is transformed into a surrogate continuous bandit problem where the teacher samples environments in order to maximize absolute learning progress of its student. We present a new algorithm modeling absolute learning progress with Gaussian mixture models (ALPGMM). We also adapt existing algorithms and provide a complete study in the context of DRL. Using parameterized variants of the BipedalWalker environment, we study their efficiency to personalize a learning curriculum for different learners (embodiments), their robustness to the ratio of learnable/unlearnable environments, and their scalability to nonlinear and highdimensional parameter spaces. Videos and code are available at https://github.com/flowersteam/teachDeepRL.
READ FULL TEXT VIEW PDFNone
Thesis on proc generation of environments using ppo as a student on MiniGrid environments
We address the strategic student problem. This problem is well known in the developmental robotics community [1], and formalizes a setting where an agent has to sequentially select tasks to train on to maximize its average competence over the whole set of tasks after a given number of interactions. To address this problem, several works [2, 3, 4] proposed to use automated Curriculum Learning (CL) strategies based on Learning Progress (LP) [5], and showed that populationbased algorithms can benefit from such techniques. Inspired by these initial results, similar approaches [6] were then successfully applied to DRL agents in continuous control scenarios with discrete sets of goals, here defined as tasks varying only by their reward functions (e.g reaching various target positions in a maze). Promising results were also observed when learning to navigate in discrete sets of environments, defined as tasks differing by their state space (e.g escaping from a set of mazes) [7, 8].
In this paper, we study for the first time whether LPbased curriculum learning methods are able to scaffold generalist DRL agents in continuously parameterized environments. We compare the reuse of Robust Intelligent Adaptive Curiosity (RIAC) [9] in this new context to Absolute Learning Progress  Gaussian Mixture Model (ALPGMM), a new GMMbased approach inspired by earlier work on developmental robotics [4]
that is well suited for DRL agents. Both these methods rely on Absolute Learning Progress (ALP) as a surrogate objective to optimize with the aim to maximize average competence over a given parameter space. Importantly, our approaches do not assume a direct mapping from parameters to environments, meaning that a given parameter vector encodes a distribution of environments with similar properties, which is closer to realworld scenarios where stochasticity is an issue.
Recent work [10] already showed impressive results in continuously parameterized environments. The POET approach proved itself to be capable of generating and mastering a large set of diverse BipedalWalker environments. However, their work differs from ours as they evolve a population of agents where each individual agent is specialized for a single specific deterministic environment, whereas we seek to scaffold the learning of a single generalist agent in a training regime where it never sees the same exact environment twice.
As our approaches make few assumptions, they can deal with illdefined parameter spaces that include unfeasible subspaces and irrelevant parameter dimensions. This makes them particularly well suited to complex continuous parameter spaces in which expertknowledge is difficult to acquire. We formulate the Continuous TeacherStudent (CTS) framework to cover this scope of challenges, opening the range of potential applications.
Main contributions:
A Continuous TeacherStudent setup enabling to frame TeacherStudent interactions for illdefined continuous parameter spaces encoding distributions of tasks. See Sec. 3.
Design of two parameterized BipedalWalker environments, wellsuited to benchmark CL approaches on continuous parameter spaces encoding distributions of environments with procedural generation. See Sec. 4.3.
ALPGMM, a CL approach based on Gaussian Model Mixture and absolute LP that is well suited for DRL agents learning continuously parameterized tasks. See Sec. 4.1.
First study of ALPbased teacher algorithms leveraged to scaffold the learning of generalist DRL agents in continuously parameterized environments. See Sec. 5.
Curriculum learning
, as formulated in the supervised machine learning community, initially refers to techniques aimed at organizing labeled data to optimize the training of neural networks
[11, 12, 13]. Concurrently, the RL community has been experimenting with transfer learning methods, providing ways to improve an agent on a target task by pretraining on an easier source task
[14]. These two lines of work were combined and gave birth to curriculum learning for RL, that is, methods organizing the order in which tasks are presented to a learning agent so as to maximize its performance on one or several target tasks.Learning progress has often been used as an intrinsically motivated objective to automate curriculum learning in developmental robotics [5], leading to successful applications in populationbased robotic control in simulated [4, 15] and realworld environments [2, 3]
. LP was also used to accelerate the training of LSTMs and neural turing machines
[16], and to personalize sequences of exercises for children in educational technologies [17].A similar TeacherStudent framework was proposed in [7], which compared teacher approaches on a set of navigation tasks in Minecraft [18]. While their work focuses on discrete sets of environments, we tackle the broader challenge of dealing with continuous parameter spaces that map to distributions over environments, and in which large parts of the parameter space may be unlearnable.
Another form of CL has already been studied for continuous sets of tasks [19], however they considered goals (varying by their reward function), where we tackle the more complex setting of learning to behave in a continuous set of environments (varying by their state space). The GOALGAN algorithm also requires to set a reward range of "intermediate difficulty" to be able to label each goal in order to train the GAN, which is highly dependent on both the learner’s skills and the considered continuous set of goals. Besides, as the notion of intermediate difficulty provides no guarantee of progress, this approach is susceptible to focusing on unlearnable goals for which the learner’s competence stagnates in the intermediate difficulty range.
In this section, we formalize our Continuous TeacherStudent framework. It is inspired from earlier work in developmental robotics [9, 1] and intelligent tutoring systems [17]. The CTS framework is also close to earlier work on TeacherStudent approaches for discrete sets of tasks [7]. In CTS however, teachers sample parameters mapping to distributions of tasks from a continuous parameter space. In the remainder of this paper, we will refer to parameter sampling and task distribution sampling interchangeably, as one parameter directly maps to a task distribution.
In CTS, learning agents, called students, are confronted with episodic Partially Observable Markov Decision Processes (POMDP) tasks
. For each interaction step in , a student collects an observation , performs an action , and receives a corresponding reward . Upon task termination, an episodic reward is computed, with the length of the episode.The teacher interacts with its student by selecting a new parameter , mapped to a task distribution , proposing tasks to its student and observing , the average of the episodic rewards . The new parameterreward tuple is then added to an history database of interactions that the teacher leverages to influence the parameter selection in order to maximize the student’s final competence return across the parameter space. Formally, the objective is
(1) 
with the predefined maximal number of teacherstudent interactions and , a factor weighting the relative importance of each task distribution in the optimization process, enabling to specify whether to focus on specific subregions of the parameter space (i.e. harder target tasks). As students are considered as blackbox learners, the teacher solely relies on its database history for parameter sampling and does not have access to information about its student’s internal state, algorithm, or perceptual and motor capacities.
The teacher does not know the evolution of difficulty across the parameter space and therefore assumes a nonlinear, piecewise smooth function. The parameter space may also be illdefined. For example, there might be subregions of the parameter space in which competence improvements on any parameter is not possible given the state transition functions of tasks sampled in any (i.e tasks are either trivial or unfeasible). Additionally, given a parameter space , there might exist an equivalent parameter space with , constructed with a subset of the dimensions of , meaning that there might be irrelevant or redundant dimensions in .
In the following sections, we will restrict our study to CTS setups in which teachers sample only one task per selected parameter vector (i.e and = ) and do not prioritize the learning of specific subspaces (i.e ).
In this section, we will describe our absolute LPbased teacher algorithms, our reference teachers, and present the continuously parameterized BipedalWalker environments used to evaluate them.
Of central importance to this paper is the concept of learning progress, formulated as a theoretical hypothesis to account for intrinsically motivated learning in humans [20], and applied for efficient robot learning [5, 2]. Inspired by some of this work [3, 21], we frame our two teacher approaches as a MultiArmed Bandit setup in which arms are dynamically mapped to subspaces of the parameter space, and whose values are defined by an absolute average LP utility function. The objective is then to select subspaces on which to sample a distribution of tasks in order to maximize ALP. ALP gives a richer signal than (positive) LP as it enables the teacher to detect when a student is losing competence on a previously mastered parameter subspace (thus preventing catastrophic forgetting).
RIAC [9] is a task sampling approach whose core idea is to split a given parameter space in hyperboxes (called regions) according to their absolute LP, defined as the difference of cumulative episodic reward between the newest and oldest tasks sampled in the region. Tasks are then sampled within regions selected proportionally to their ALP score. This approach can easily be translated to the problem of sampling distributions of tasks, as is the case in this work. To avoid a known tendency of RIAC to oversplit the space [19], we added a few minor modifications to the original architecture to constrain the splitting process. Details can be found in appendix B.
Another more principled way of sampling tasks according to LP measures is to rely on the well known Gaussian Mixture Model [22]
[23] algorithms. This concept has already been successfully applied in the cognitive science field as a way to model intrinsic motivation in early vocal developments of infants [4]. In addition of testing for the first time their approach (referred to as CovarGMM) on DRL students, we propose a variant based on an ALP measure capturing longterm progress variations that is wellsuited for RL setups. See appendix B for a description of their method.The key concept of ALPGMM is to fit a GMM on a dataset of previously sampled parameters concatenated to their respective ALP measure. Then, the Gaussian from which to sample a new parameter is chosen using an EXP4 bandit scheme [24] where each Gaussian is viewed as an arm, and ALP is its utility. This enables the teacher to bias the parameter sampling towards highALP subspaces. To get this perparameter ALP value, we take inspiration from earlier work on developmental robotics [21]: for each newly sampled parameter and associated episodic reward , the closest (Euclidean distance) previously sampled parameter (with associated episodic reward ) is retrieved using a nearest neighbor algorithm (implemented with a KDTree [25]). We then have
(2) 
The GMM is fit periodically on a window containing only the most recent parameterALP pairs (here the last ) to bound its time complexity and make it more sensitive to recent highALP subspaces. The number of Gaussians is adapted online by fitting multiple GMMs (here having from to Gaussians) and keeping the best one based on Akaike’s Information Criterion [26]. Note that the nearest neighbor computation of perparameter ALP uses a database that contains all previously sampled parameters and associated episodic rewards, which prevents any forgetting of longterm progress. In addition to its main task sampling strategy, ALPGMM also samples random parameters to enable exploration (here ). See Algorithm 1 for pseudocode and appendix B for a schematic view of ALPGMM.
In Random, parameters are sampled randomly in the parameter space for each new episode. Although simplistic, similar approaches in previous work [3] proved to be competitive against more elaborate forms of CL.
A handconstructed approach, sampling random task distributions in a fixedsize sliding window on the parameter space. This window is initially set to the easiest area of the parameter space and is then slowly moved towards complex ones, with difficulty increments only happening if a minimum average performance is reached. Expert knowledge is used to find the dimensions of the window, the amplitude and direction of increments, and the average performance threshold. Pseudocode is available in Appendix B.
The BipedalWalker environment [27] offers a convenient testbed for continuous control, allowing to easily build parametric variations of the original version [28, 10]. The learning agent, embodied in a bipedal walker, receives positive rewards for moving forward and penalties for torque usage and angular head movements. Agents are allowed steps to reach the other side of the map. Episodes are aborted with a reward penalty if the walker’s head touches an obstacle.
To study the ability of our teachers to guide DRL students, we design two continuously parameterized BipedalWalker environments enabling the procedural generation of walking tracks:
Stump Tracks A D parametric environment producing tracks paved with stumps varying by their height and spacing. Given a parameter vector , a track is constructed by generating stumps spaced by
and whose heights are defined by independent samples in a normal distribution
.Hexagon Tracks A more challenging D parametric BipedalWalker environment. Given offset values , each track is constructed by generating hexagons having their default vertices’ positions perturbed by strictly positive independent samples in . The remaining parameters are distractors defining the color of each hexagon. This environment is challenging as there are no subspaces generating trivial tracks with 0height obstacles (as offsets to the default hexagon shape are positive). This parameter space also has nonlinear difficulty gradients as each vertices have different impacts on difficulty when modified.
All of the experiments done in these environments were performed using OpenAI’s implementation of SoftActor Critic [29] as the single student algorithm. To test our teachers’ robustness to students with varying abilities, we use different walker morphologies (see Figure 3). Additional details on these two environments along with track examples are available in Appendix E.

To assess the performance of all of our approaches on our BipedalWalker environments, we define a binary competence return measure stating whether a given track distribution is mastered or not, depending on the student’s episodic reward . We set the reward threshold to , which was used in [10] to ensure "reasonably efficient" walking gates for default bipedal walkers trained on environments similar to ours. Note that this reward threshold is only used for evaluation purposes and in the Oracle condition. Performance is then evaluated periodically by sampling a single track in each track distribution of a fixed evaluation set of distributions sampled uniformly in the parameter space. We then simply measure the percentage of mastered tracks. During evaluation, learning in DRL agents is turned off.
Through our experiments we answer three questions about ALPGMM, CovarGMM and RIAC:
Are ALPGMM, CovarGMM and RIAC able to optimize their students’ performance better than random approaches and teachers exploiting environment knowledge?
How does their performance scale when the proportion of unfeasible tasks increases?
Are they able to scale to highdimensional sampling spaces with irrelevant dimensions?
Figure 8 provides a visualization of the sampling trajectory observed in a representative ALPGMM run for a default walker. Each plot shows the location of each Gaussian of the current mixture along with the track distributions subsequently sampled. At first (a), the walker does not manage to make any significant progress. After episodes (b) the student starts making progress on the leftmost part of the parameter space, especially for track distributions with a spacing higher than , which leads ALPGMM to focus its sampling in that direction. After k episodes (c) ALPGMM has shifted its sampling strategy to more complex regions. The analysis of a typical RIAC run is detailed in Appendix C (fig. 38).
Figure 12 shows learning curves for each condition paired with short, default and quadrupedal walkers. First of all, for short agents (a), one can see that Oracle is the best performing algorithm, mastering more than % of the test set after Million steps. This is an expected result as Oracle knows where to sample simple track distributions, which is crucial when most of the parameter space is unfeasible, as is the case with short agents. ALPGMM is the LPbased teacher with highest final mean performance, reaching % against % for CovarGMM and
% for RIAC. This performance advantage for ALPGMM is statistically significant when compared to RIAC (Welch’s ttest at
M steps: ), however there is no statistically significant difference with CovarGMM (). All LPbased teachers are significantly superior to Random ().Regarding default bipedal walkers (b), our handmade curriculum (Oracle) performs better than other approaches for the first Million steps and then rapidly decreases to end up with a performance comparable to RIAC and CovarGMM. All LPbased conditions end up with a final mean performance statistically superior to Random (). ALPGMM is the highest performing algorithm, significantly superior to Oracle (), RIAC () and CovarGMM ().
For quadrupedal walkers (c), Random, ALPGMM, CovarGMM and RIAC agents quickly learn to master nearly % of the test set, without significant differences apart from CovarGMM being superior to RIAC (). This indicates that, for this agent type, the parameter space of Stump Tracks is simple enough that trying random tracks for each new episode is a sufficient curriculum learning strategy. Oracle teachers perform significantly worse than any other method ().
The mean performance (32 seeded runs) is plotted with shaded areas representing the standard error of the mean.
Through this analysis we answered our first experimental question by showing how ALPGMM, CovarGMM and RIAC, without strong assumptions on the environment, managed to scaffold the learning of multiple students better than Random. Interestingly, ALPGMM outperformed Oracle with default agents, and RIAC, CovarGMM and ALPGMM surpassed Oracle with the quadrupedal agent, despite its advantageous use of domain knowledge. This indicates that training only on track distributions sampled from a sliding window that end up on the most difficult parameter subspace leads to forgetting of simpler task distributions. Our approaches avoid this issue through efficient tracking of their students’ learning progress.
A crucial requirement when designing allpurpose teacher algorithms is to ensure their ability to deal with parameter spaces that are illdefined w.r.t to the considered student. To study this property we performed additional experiments on Stump Tracks where we gradually increased the stump height dimension range, which increases the amount of unfeasible tracks.
Results are summarized in Table 1
. To assess whether a condition is robust to increasing unfeasibility, one can look at the pvalue of the Welch’s ttest performed on the final performance measure between the condition run on the original parameter space and the same condition run on a wider space. High pvalue indicates that there is not enough evidence to reject the null hypothesis of no difference, which can be interpreted as being robust to parameter spaces containing more unfeasible tasks. Using this metric, it is clear that ALPGMM is the most robust condition among the presented LPbased teachers, with a pvalue of
when increasing the stump height range from to compared to for RIAC and for CovarGMM. When going from to , ALPGMM is the only LPbased teacher able to maintain most of its performance (). Although Random also seems to show robustness to increasingly unfeasible parameter spaces ( when going from to and from to ), it is most likely due to its stagnation in low performances. Compared to all other approaches, ALPGMM remains the highest performing condition in both parameter space variations ().Cond. \ Stump height  

ALPGMM  ()  ()  
CovarGMM  ()  ()  
RIAC  ()  ()  
Random  ()  () 
The average performance with standard deviation (after 20 Million steps) on the original Stump Tracks’ testset is reported (32 seeds per condition).The additional pvalues inform whether conditions run in the original Stump Tracks (
) are significantly better than when run on variations with higher maximal stump height.These additional experiments on Stump Tracks showed that our LPbased teachers are able to partially maintain the performance of their students in parameter spaces with higher proportions of unfeasible tasks, with a significant advantage for ALPGMM.
To assess whether ALPGMM, CovarGMM and RIAC are able to scale to parameter spaces of higher dimensionality containing irrelevant dimensions, and whose difficulty gradients are nonlinear, we performed experiments with quadrupedal walkers on Hexagon Tracks, our 12dimensional parametric BipedalWalker environment. Results are shown in Figure 15. In the first Millions steps, one can see that Oracle has a large performance advantage compared to LPbased teachers, which is mainly due to its knowledge of initial progress niches. However, by the end of training, ALPGMM significantly outperforms Oracle (), reaching an average final performance of % against for Oracle. Compared to CovarGMM and RIAC, the final performance of ALPGMM is also significantly superior ( and
, respectively) while being more robust and having less variance (see appendix
D). All LPbased approaches are significantly better than Random ().Experiments on the Hexagon Tracks showed that ALPGMM is the most suitable condition for complex highdimensional environments containing irrelevant dimensions, nonlinear parameter spaces and large proportions of initially unfeasible tasks.
To better grasp the general properties of our teacher algorithms, additional abstract experiments without DRL students were also performed for parameter spaces with increasing number of dimensions (relevant and irrelevant) and increasing ratio of initially unfeasible subspaces, showing that GMMbased approaches performed best (see Appendix A).
This work demonstrated that LPbased teacher algorithms could successfully guide DRL agents to learn in difficult continuously parameterized environments with irrelevant dimensions and large proportions of unfeasible tasks. With no prior knowledge of its student’s abilities and only loose boundaries on the task space, ALPGMM, our proposed teacher, consistently outperformed random heuristics and occasionally even expertdesigned curricula.
ALPGMM, which is conceptually simple and has very few crucial hyperparameters, opensup exciting perspectives inside and outside DRL for curriculum learning problems. Within DRL, it could be applied to previous work on autonomous goal exploration through incremental building of goal spaces
[30]. In this case several ALPGMM instances could scaffold the learning agent in each of its autonomously discovered goal spaces. Another domain of applicability is assisted education, for which current state of the art relies heavily on expert knowledge [17] and is mostly applied to discrete task sets.This work was supported by Microsoft Research through its PhD Scholarship Programme. Experiments were carried out using the PlaFRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Université de Bordeaux, Bordeaux INP and Conseil Régional d’Aquitaine (see https://www.plafrim.fr/) and the Curta platform (see https://redmine.mcia.fr/).
IEEE transactions on evolutionary computation
, 11(2):265–286, 2007. doi: 10.1109/TEVC.2006.890271.The malmo platform for artificial intelligence experimentation.
In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 915 July 2016, pages 4246–4247, 2016.An dimensional toy parameter space was implemented to simulate a student learning process, enabling the study of our teachers in a controlled deterministic environment without DRL agents. A parameter directly maps to an episodic reward depending on the history of previously sampled parameters. The parameter space is divided in hypercubes and enforces the following rules:
Sampling a parameter in an "unlocked" hypercube results in a positive reward ranging from to depending on the amount of already sampled parameters in the hypercube: if parameters were sampled in it, the next one will yield a reward of . Sampling a parameter located in a "locked" hypercube does not yield any reward.
At first, all hypercubes are "locked" except for one, located in a corner.
Sampling parameters in an unlocked hypercube unlocks its neighboring hypercubes.
Results are displayed in Figure 26. We use the median percentage of unlocked hypercubes as a performance metric. A first experiment was performed on a D toy space with hypercubes per dimensions. In this experiment one can see that all LPbased approaches outperform Random by a significant margin. CovarGMM is the highest performing algorithm. This first toyspace will be used as a point of reference for our following analysis, for which all conditions were tested on a panel of toy spaces with varying number of meaningful dimensions (first row of Figure 26), irrelevant dimensions (second row) and number of hypercubes (third row).
By looking at the first row of Figure 26, one can see that increasing the dimension size seems to have a greater negative impact on RIAC than on GMMbased approaches: RIAC, which was between ALPGMM and CovarGMM in terms of median performance in our reference experiment is now clearly underperforming them on all toy spaces. In the D and D cases RIAC is even outperformed by the Random condition after k episodes and k episodes, respectively. For the D toy space RIAC consistently outperforms Random, reaching a median final performance of % after M episodes. In this D toy space ALPGMM and CovarGMM both reach % of median performance after M episodes. CovarGMM is the highest performing condition in each toyspace, closely followed by ALPGMM.
The second row of Figure 26 shows how performances of our approaches vary when adding irrelevant dimensions to the D toyspace. To better grasp the properties of these additional dimensions, one can see that Random is not affected by them. With , and additional useless dimensions, RIAC is consistently inferior to GMMbased conditions in terms of median performance. RIAC median performance is only above Random during the first k episodes. ALPGMM is the highest performing algorithm throughout training for toy spaces with and irrelevant dimensions, closely followed by CovarGMM. In the toy space with irrelevant dimensions, ALPGMM outperforms CovarGMM in the first k episodes but end up reaching a % median performance after k episodes against only k episodes for CovarGMM.
The last row shows how performance changes according to the number of hypercubes. Given our toy space rules, increasing the number of hypercubes reduces the initial area where reward is obtainable in the parameter space, and therefore allows us to study the sensitivity of our approaches to detect learnable subspaces. Random struggles in all toyspaces compared to other conditions and to its performances on the reference experiment with hypercubes per dimensions. CovarGMM and RIAC are the best performing conditions for toyspaces with hypercubes per dimensions. However, when increasing to and hypercubes per dimensions, CovarGMM remains the best performing condition but RIAC is now underperforming compared to ALPGMM.
Overall these experiments showed that GMMbased approaches scaled better than RIAC on parameter spaces with large number of relevant or irrelevant dimensions, and large number of (initially) unfeasible parameter spaces. Among these GMMbased approaches, contrary to experiments with DRL students on BipedalWalker environments, CovarGMM proved to be better than ALPGMM for these toy spaces.
All of our experiments were performed with OpenAI’s implementation^{1}^{1}1https://github.com/openai/spinningup
of SAC as our DRL student. We used the same 2layered (400, 300) network architecture with ReLU for the Q, V and policy networks. The policy network’s output uses tanh activations. The entropy coefficient and learning progress were respectively set to
and . Gradient steps are performed every environment steps by selecting samples from a replay buffer with a fixed sized of millions.To avoid a known tendency of RIAC to oversplit the parameter space [19], we added a few modifications to the original architecture. The essence of our modified RIAC could be summarized as follows (hyperparameters settings are given in parenthesis):
When collecting a new parameterreward pair, it is added to its respective region. If this region reaches its maximal capacity (), a split attempt is performed.
When trying to split a parent region into two children regions, () candidate splits on random dimensions and thresholds are generated. Splits resulting in one of the child regions, or , having less than () individuals are rejected. Likewise, to avoid having extremely small regions, a minimum size is enforced for each region’s dimensions (set to of the initial range of each dimensions of the parameter space). The split with the highest score, defined as , is kept. If no valid split was found, the region flushes its oldest points (the oldest quarter of pairs sampled in the region are removed).
At sampling time, several strategies are combined:
: a random parameter is chosen in the entire space.
: a region is selected proportionally to its ALP and a random parameter is sampled within the region.
: a region is selected proportionally to its ALP and the worst parameter with lowest associated episodic reward is slightly mutated (by adding a Gaussian noise ).
We send the reader back to the original papers of RIAC [9, 3] for detailed motivations and pseudocode descriptions.
Originating from the developmental robotic field [4], this approach inspired the design of ALPGMM. In CovarGMM, instead of fitting a GMM on the parameter space concatenated with ALP as in ALPGMM, they concatenate each parameters with its associated episodic return and time (relative to the current window of considered parameters). New parameters are then chosen by sampling on a Gaussian selected proportionally to its positive covariance between time and episodic reward, which emulates positive LP. Contrary to ALPGMM, they ignore negative learning progress and do not have a way to detect long term LP (i.e LP is only measured for the currently fitted datapoints). Although not initially present in CovarGMM, we compute the number of Gaussians online as in ALPGMM to compare the two approaches solely on their LP measure. Likewise, CovarGMM is given the same hyperparameters as ALPGMM (see section 4).
Oracle has been manually crafted based on knowledge acquired over multiple runs of the algorithm. It uses a step size , with a vector containing the maximal distance for each dimension of the parameter space. Before each new episode, the window () is slid toward more complex task distributions by only if the average episodic reward of the last proposed tasks is above . See Algorithm 2 for pseudocode.
Figure 33 provides a visualization of the evolution of Oracle’s parameter sampling for a typical run in Stump Tracks. One can see that the final position of the parameter sampling window (corresponding to a subspace that cannot be mastered by the student) is reached after episodes (c) and remains the same up to the end of the run, totaling episodes. This endoftraining focus on a specific part of the parameter space is the cause of the forgetting issues of Oracle (see section 5.1).











Gold lines are medians, surrounded by a box showing the first and third quartile, which are then followed by whiskers extending to the last datapoint or
times the interquartile range. Beyond the whiskers are outlier datapoints.
(a): For short agents, Random always endup mastering 0% of the track distributions of the test set, except for a single run that is able to master 3 track distributions (6%). LPbased teachers obtained superior performances than Random while still failing to reach nonzero performances by the end of training in runs for ALPGMM, for CovarGMM and for RIAC.



To better understand the properties of all of the tested conditions in Hexagon Tracks, we analyzed the distributions of the percentage of mastered environments of the test set after training for Millions (environment) steps. Using Figure 48, one can see that ALPGMM both has the highest median performance and narrowest distribution. Out of the repeats, only Oracle and ALPGMM always endup with positive final performance scores whereas CovarGMM, RIAC and Random endup with performance in , and runs, respectively. Interestingly, in all repeats of any condition, the student manages to master part of the test set at some point (i.e nonzero performance), meaning that runs that endup with final test performance actually experienced catastrophic forgetting. This showcase the ability of ALPGMM to avoid this forgetting issue through efficient tracking of its student’s absolute learning progress.
In BipedalWalker environments, observations vectors provided to walkers are composed of 10 lidar sensors (providing distance measures), the hull angle and velocities (linear and angular), the angle and speed of each hip and knee joints along with a binary vector which informs whether each leg is touching the ground or not. This sums up to dimensions for our two bipedal walkers and for the quadrupedal version. To account for its increased weight and additional legs, we increased the maximal torque usage and reduced the torque penalty for quadrupedal agents.
In Stump Tracks, the range of the mean stump height is set to , while the spacing range lies in . Examples of randomly generated tracks are left for consultation in Figure 54.
In Hexagon Tracks, the range of the dimensions of the space are set to . Figure 56 provides a visual explanation of how hexagons are generated. Figure 74 shows examples of randomly generated tracks.





















