The idea of organizing the learning sequence of a machine is an old concept that stems from multiple works in reinforcement learning(Selfridge et al., 1985; Schmidhuber, 1991), developmental robotics (Oudeyer et al., 2007)1993; Bengio et al., 2009), from which the Deep RL community borrowed the term Curriculum Learning (CL). Automatic CL refers to approaches able to autonomously adapt their task sampling distribution to their evolving learner with minimal expert knowledge. Several ACL approaches have recently been proposed (Florensa et al., 2018; Racanière et al., 2019; OpenAI et al., 2019; Portelas et al., 2019; Colas et al., 2019; Pong et al., 2019; Jabri et al., 2019; Laversanne-Finot et al., 2018).
ACL and exploration.
One of the limits of ACL is that when applied to a large parameterized task space with few learnable subspaces, as when considering a rich procedural generation system, they loose a lot of time finding the ”optimal parameters” at a given point in time (e.g. the niches of progress in Learning Progress-based approaches) through task exploration. We also hypothesize that these additional tasks presented to the DRL learner have a cluttering effect on the gathered training data, which adds noise in its already brittle gradient-based optimization and leads to sub-optimal performances.
Given this hypothesized drawback of task exploration, we propose to study whether ACL techniques could be improved by having a two stage approach consisting in 1) a preliminary run with ACL from which prior knowledge on the task space is extracted, and 2) a second independent run leveraging this prior knowledge to propose a better curriculum to the DRL agent. For its simplicity and versatility, we choose to develop such an approach with ALP-GMM (Portelas et al., 2019), a recent ACL algorithm for continuous task spaces.
Within DRL, Policy Distillation (Czarnecki et al., 2019) consists in leveraging a previously trained policy, the ”teacher”, and use it to perform behavior cloning by training a ”student” policy to jointly maximize its reward on one or several tasks while minimizing the distance between its action distribution compared to the teacher’s. This allows to speed up the learning of bigger architectures and/or to leverage task-experts to train a single learner on a set of tasks. From this point of view, this work can be seen as a complementary approach interested in how to perform Curriculum Distillation when considering a continuous space of tasks.
Similar ideas where developed for supervised learning by Hacohen and Weinshall (2019). In their work, authors propose an approach to infer a curriculum from past training for an image classification task: they use a first network trained without curriculum and use its predictive confidence for each image as a difficulty measure that a subsequently trained network uses for curriculum generation. The idea that a knowledge distillation procedure can be beneficial even when the teacher and student policies have identical architectures has also been studied in supervised learning (Furlanello et al., 2018; Yim et al., 2017). In this work, we propose to extend these concepts to DRL scenarios.
ALP-GMM (Portelas et al., 2019) is a Learning Progress (LP) based ACL technique for continuous task spaces that does not assume prior knowledge on the task space. It is inspired by previous works in developmental robotics (Baranes and Oudeyer, 2009; Moulin-Frier et al., 2014). ALP-GMM frames the task sampling problem into an EXP-4 non-stationary Multi-Armed bandit setup (Auer et al., 2002)
in which arms are Gaussians spanning over the space of tasks’ parameters whose utility is defined with a local LP measure. The essence of ALP-GMM is to periodically fit a Gaussian Mixture Model (GMM) on recently sampled tasks’ parametersconcatenated with their respective LP
. Then, the Gaussian from which to sample a new task is chosen proportionally to its mean LP dimension. Task exploration happens initially through a bootstrapping period of random task sampling and during training by occasional random task sampling with probability.
Inferred progress Niches (IN)
Using ALP-GMM is convenient for our target experiments as deriving an expert curriculum from an initial run is straightforward: one simply needs to gather the sequence of GMMs that were periodically fitted along training:
with the total number of GMMs in the list, their respective number of components and the Learning Progress of each Gaussian. By keeping only Gaussians with above a predefined threshold , we can get a curated list . We name the resulting ACL approach Infered progress Niches (IN) and propose variants to select which GMM from is used to sample tasks over time during the second run:
Pool-based (IN-P) A rather crude approach is to disregard the ordering of and consider the entire trajectory of GMMs as one single pool of Gaussians, ie. one big mixture having components.
Time-based (IN-T) In this version is stepped in periodically at the same rate than the preliminary ALP-GMM run (ie. a new GMM every episodes in our experiments).
Reward-based (IN-R) Another option is to iterate over only once the mean episodic reward over tasks recently sampled from the current GMM matches or surpasses the mean episodic reward recorded during the initial run (on the same GMM).
Regardless the selection process, given a GMM, a new task is selected by sampling a tasks’ parameter on a Gaussian selected proportionally to its value.
Mixing both in AGAIN.
Simply using one of the proposed IN algorithms directly for the second run lacks adaptive mechanisms towards the characteristics of the second agent, whose initial parameters and data training stream are different from the first, which could lead to failure cases where the expert curriculum and the second learner are no longer ”in phase”. Additionally, if the initial run failed to discover progress niches, IN is bound to fail. As such we propose to combine IN with an ALP-GMM teacher (with low task-exploration) in the second run. The resulting Alp-Gmm And Inferred progress Niches approach, AGAIN for short, samples tasks from a GMM that is composed of the current mixture of both ALP-GMM and IN. See figure 1 for a schematic pipeline and appendix B for details.
3 Experiments and Results
We propose to test our considered variants and baselines on a parametric version of BipedalWalker proposed by Portelas et al. (2019), which generates walking tracks paved with stumps whose height and spacing are defined by a
-D parameter vector used for the procedural generation of tasks. This continuous task space has boundaries set in such a way that a substantial part of the space consists in unfeasible tracks for the default walker. As in their work, we also test our approaches with a modified short-legged walker, which constitutes an even more challenging scenario (as the task space is unchanged). All ACL variants are tested when paired with a Soft-Actor Critic(Haarnoja et al., 2018) policy. Performance is measured by tracking the percentage of mastered tasks from a fixed test set. See appendix C for details.
Is re-training from scratch beneficial?
The end performances of all tested conditions are summarized in table LABEL:results-table. Interestingly, for all tested variants, retraining the DRL agent from scratch in the second run gave superior end performances than fine-tuning using the weights of the first run in all tested variants
. This showcase the brittleness of gradient-based training and the difficulty of transfer learning. Despite this, even fine-tuned variants reached superior end-performances than classical ALP-GMM, meaning that the change in curriculum strategy in itself is already beneficial.
Is it useful to re-use ALP-GMM in the second run?
In the default walker experiments, AGAIN-R, T and P conditions mixing ALP-GMM and IN in the second run reached lower mean performances than their respective IN variants. However, the exact opposite is observed for IN-R and IN-T variants in the short walker experiments. This can be explained by the difficulty of short walker experiments for ACL approaches, leading to preliminary 10M steps long ALP-GMM runs to have a mean end-performance of , compared to in the default walker experiments. All these run failures led to many GMMs lists used in IN to be of very low-quality, which illustrates the advantage of AGAIN that is able to emancipate from IN using ALP-GMM.
Consistently with the precedent analysis, mixing ALP-GMM with IN in the second run is not essential in default walker experiments, as the best performing ACL approach is IN-P. This most likely suggests that the improved adaptability of the curriculum when using AGAIN is outbalanced by the added noise (due to the low task-exploration). However in the more complex short walker experiments, mixing ALP-GMM with IN is essential, especially for AGAIN-R, which substantially outperforms ALP-GMM and other AGAIN and IN variants, reaching a mean end performance of . The difference in end-performance between AGAIN-R and Oracle, our hand-made expert using privileged information who obtained , is not statistically significant ().
|Condition||Short walker||Default walker|
Table 1: Experiments on Stump Tracks with short and default bipedal walkers. The average performance with standard deviation after 10 Millions steps (IN and AGAIN variants) or 20 Million steps (others) is reported (30 seeds per condition). For IN and AGAIN we also test variants that do not retrain the weights of the policy used in the second run in a post-training Welch’s student t-test (
The average performance with standard deviation after 10 Millions steps (IN and AGAIN variants) or 20 Million steps (others) is reported (30 seeds per condition). For IN and AGAIN we also test variants that do not retrain the weights of the policy used in the second runfrom scratch but rather fine-tune them from the preliminary run. Indicates whether performance difference with ALP-GMM is statistically significant ie.
in a post-training Welch’s student t-test (for performance advantage w.r.t ALP-GMM and for performance disadvantage).
4 Conclusion and Discussion
In this work we presented Alp-Gmm And Inferred progress Niches, a simple yet effective approach to learn prior knowledge over a space of tasks to design a curriculum tailored to a DRL agent. Instead of following the same exploratory ACL approach over the entire training, AGAIN performs a first preliminary run with ALP-GMM, derives a list of progress niches from it, and uses this list to build an expert curriculum that is combined with a low task-exploration ALP-GMM teacher for a second run of the same DRL agent, trained from scratch.
Beyond tabula rasa?
In this work we showed that a non-tabula rasa curriculum generator that leveraged prior knowledge over the task space (from a preliminary run) outperformed the regular approach that learned to generate an entire curriculum from scratch. However, we also demonstrated that, from the point of view of the DRL learner, it is actually better to restart tabula rasa (with a non-tabula rasa curriculum generator), which is a very interesting perspective and opens several lines for future work.
Beyond proposing a two-stage ACL technique for a single DRL agent, the experimental setup of this work could be seen as a particular case of a broader problem we propose to name Classroom Teaching (CT). CT defines a family of problems in which a meta-ACL algorithm is tasked to either sequentially or simultaneously generate multiple curricula tailored for each of the learning students, all having potentially varying abilities. CT differs from the problems studied in population-based developmental robotics (Forestier et al., 2017)et al., 2019) as in CT the number and characteristics of learners are predefined, and the objective is to foster maximal learning progress over all learners rather than iteratively constructing high-performing policies. Studying CT scenarios brings DRL closer to human education research problems and might stimulate the design of methods that alleviate the expensive use of expert knowledge in current state of the art assisted education (Clément et al., 2015; Koedinger et al., 2013).
- The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §2.
R-IAC: robust intrinsically motivated exploration and active learning. IEEE Trans. Autonomous Mental Development 1 (3), pp. 155–169. External Links: Cited by: §2.
Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pp. 41–48. External Links: Cited by: §1.
- Model selection and akaike’s information criterion (aic): the general theory and its analytical extensions. Psychometrika 52 (3), pp. 345–370. External Links: Cited by: Appendix A.
- Multi-Armed Bandits for Intelligent Tutoring Systems. Journal of Educational Data Mining (JEDM) 7 (2), pp. 20–48. Cited by: §4.
- CURIOUS: intrinsically motivated modular multi-goal reinforcement learning. In International Conference on Machine Learning, pp. 1331–1340. Cited by: §1.
- Distilling policy distillation. CoRR abs/1902.02186. External Links: Cited by: §1.
Learning and development in neural networks: the importance of starting small. Cognition 48 (1), pp. 71 – 99. External Links: Cited by: §1.
- Automatic goal generation for reinforcement learning agents. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1514–1523. Cited by: §1.
- Intrinsically motivated goal exploration processes with automatic curriculum learning. CoRR abs/1708.02190. External Links: Cited by: §4.
- Born-again neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1602–1611. Cited by: §1.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR abs/1801.01290. External Links: Cited by: §3.
- On the power of curriculum learning in training deep networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 2535–2544. Cited by: §1.
- Unsupervised curricula for visual meta-reinforcement learning. In Advances in Neural Information Processing Systems 32, pp. 10519–10530. Cited by: §1.
- New potentials for data-driven intelligent tutoring system development and optimization.. AI Magazine 34 (3), pp. 27–41. Cited by: §4.
- Curiosity driven exploration of learned disentangled goal spaces. In Proceedings of The 2nd Conference on Robot Learning, Proceedings of Machine Learning Research, Vol. 87, , pp. 487–504. Cited by: §1.
- Self-organization of early vocal development in infants and machines: the role of intrinsic motivation. Frontiers in Psychology (Cognitive Science) 4 (1006). External Links: Cited by: §2.
- Solving rubik’s cube with a robot hand. ArXiv abs/1910.07113. Cited by: §1.
Intrinsic motivation systems for autonomous mental development.
IEEE transactions on evolutionary computation11 (2), pp. 265–286. Cited by: §1.
- Skew-fit: state-covering self-supervised reinforcement learning. CoRR abs/1903.03698. External Links: Cited by: §1.
- Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments. External Links: Cited by: Figure 2, Appendix A, Appendix C, Appendix C, §1, §1, §2, §3.
- Automated curricula through setter-solver interactions. arXiv preprint arXiv:1909.12892. Cited by: §1.
- Curious model-building control systems. In In Proc. International Joint Conference on Neural Networks, Singapore, pp. 1458–1463. Cited by: §1.
Training and tracking in robotics.
Proceedings of the 9th International Joint Conference on Artificial Intelligence. Los Angeles, CA, USA, August 1985, pp. 670–672. Cited by: §1.
- Paired open-ended trailblazer (POET): endlessly generating increasingly complex and diverse learning environments and their solutions. CoRR abs/1901.01753. External Links: Cited by: Appendix C, §4.
- A gift from knowledge distillation: fast optimization, network minimization and transfer learning. , pp. 7130–7138. Cited by: §1.
Appendix A Alp-Gmm
ALP-GMM relies on an empirical per-task computation of Absolute Learning Progress (ALP), allowing to fit a GMM on a concatenated space composed of tasks’ parameters and respective ALP. Given a task whose parameter is and on which the policy collected the episodic reward , Its ALP is computed using the closest previous tasks (Euclidean distance) with associated episodic reward :
All previously encountered task’s parameters and their associated ALP, parameter-ALP for short, recorded in a history database , are used for this computation. Contrastingly, the fitting of the GMM is performed every episodes on a window containing the most recent parameter-ALP. The resulting mean ALP dimension of each Gaussian of the GMM is used for proportional sampling. To adapt the number of components of the GMM online, a batch of GMMs having from 2 to components is fitted on , and the best one, according to Akaike’s Information Criterion (Bozdogan, 1987)
, is kept as the new GMM. In our experiments we use the same hyperparameters as inPortelas et al. (2019) (, ), except for the percentage of random task sampling which we set to (we found it to perform better than ) when running ALP-GMM alone or when combined with IN in the second phase of AGAIN. See algorithm 1 for pseudo-code and figure 2 for a schematic pipeline. Note that in this paper we refer to ALP as LP for simplicity (ie. in from eq. 1 is equivalent to the mean ALP of Gaussians in ALP-GMM).
Appendix B Again
In order to filter the list (see eq. 1) of GMMs collected after a preliminary run of ALP-GMM into and use it as an expert curriculum, we remove any Gaussian with a below (the LP dimension is normalized between and , which requires to choose an approximate potential reward range, set to for all experiments). When all Gaussians of a GMM are discarded, the GMM is removed from . In practice, it allows to remove non-informative GMMs corresponding to the initial exploration phase of ALP-GMM, when the learner has not made any progress (hence no LP detected by the curriculum generator). is then iterated over to generate a curricula with either of the Time-based (see algo 2), Pool-based (see algo 3) or Reward-based (see algo 4) IN. The IN-P approach does not require additional hyperparameters. The IN-T requires an update rate to iterate over , which we set to (same as the fitting rate of ALP-GMM). The IN-R approach requires to extract additional data from the first run, in the form of a list :
with T the total number of GMMs in the first run (same as in ), and the mean episodic reward obtained by the first DRL agent during the last tasks sampled from the GMM. is simply obtained by removing any that corresponds to a GMM discarded while extracting from . The remaining rewards are then used as thresholds in IN-R to decide when to switch to the next GMM in .
In AGAIN (see algo. 5), the idea is to use both IN (R,T or P) and ALP-GMM (without the random bootstrapping period) for curriculum generation. We combine the changing GMM of IN and ALP-GMM over time, simply by building a GMM containing Gaussians from the current GMM of IN and ALP-GMM. By selecting the Gaussian in from which to sample a new task using their respective LP, This approach allows to adaptively modulate the task sampling between both, shifting the sampling towards IN when ALP-GMM does not detect high-LP subspaces and towards ALP-GMM when the current GMM of IN has low-LP Gaussians. Additionally, to have minimal task-exploration, which benefits ALP-GMM (allowing it to detect new progress niches), we sample random tasks with probability (compared with used for the preliminary ALP-GMM run).
Appendix C Experimental details
In our experiments, we use an implementation of Soft Actor-Critic provided by OpenAI111https://github.com/openai/spinningup. We use a layered (,) network for V, Q1, Q2 and the policy. Gradient steps are performed each environment steps, with a learning rate of and a batch size of . The entropy coefficient is set to .
Our proposed ACL variants choose parameters of tasks that encode the procedural generation of walking tracks paved with stumps in the BipedalWalker environments. As in Portelas et al. (2019), we bound the height dimension to and the spacing dimension to (regardless the walker morphology). The agent is rewarded for keeping its head straight and going forward and is penalized for torque usage. The episode is terminated after 1) reaching the end of the track, 2) reaching a maximal number of steps, or 3) head collision (for which the agent receives a strong penalty). See figure 3 for visualizations.
The Random curriculum baseline samples tasks’ parameters randomly over the parameter space. The Oracle condition is a hand-made curriculum that is very similar to IN-R, except that the list is built using expert knowledge, and all reward thresholds in are set to , which is an episodic reward value often used in the literature as characterizing a default walker having a ”reasonably efficient” walking gate (Wang et al., 2019). Basically, Oracle starts proposing tasks from a Gaussian (with std of ) located at the simplest subspace of the task space (ie. low stump height and high stump spacing) and then gradually moves the Gaussian towards the hardest subspaces (high stump height and low stump spacing) by small increments ( steps overall) happening whenever the mean episodic reward of the DRL agent over the last proposed tasks is superior to . In our experiments, consistently with (Portelas et al., 2019), which implements a similar approach, Oracle is prone to forgetting due to the strong shift in task subspace (which is why it is not the best performing condition for default walker experiments (see table LABEL:results-table).
To perform our experiments, we ran each condition for either (IN and AGAIN variants) or (others) Millions environment steps ( repeats) using one cpu and one GPU (the GPU is shared between runs), for approximately 30 hours of wall-clock time. It amounts to CPU-hours and GPU-hours. The preliminary ALP-GMM runs used in IN and AGAIN variants correspond to the first Million steps of the ALP-GMM condition (whose end-performance after Million steps is reported in table LABEL:results-table.
Appendix D Additional Visualizations
Gold lines are medians, surrounded by a box showing the first and third quartile, which are then followed by whiskers extending to the last datapoint or
times the inter-quartile range. Beyond the whiskers are outlier datapoints. From top to bottom, each box plot presents results when using either Pool-based, Reward-based, or Time-based IN, respectively.
) on a fixed test set. Shaded areas represent the standard error of the mean.