Log In Sign Up

Automated curriculum generation for Policy Gradients from Demonstrations

by   Anirudh Srinivasan, et al.

In this paper, we present a technique that improves the process of training an agent (using RL) for instruction following. We develop a training curriculum that uses a nominal number of expert demonstrations and trains the agent in a manner that draws parallels from one of the ways in which humans learn to perform complex tasks, i.e by starting from the goal and working backwards. We test our method on the BabyAI platform and show an improvement in sample efficiency for some of its tasks compared to a PPO (proximal policy optimization) baseline.


page 1

page 2

page 3

page 4


Automatic Curricula via Expert Demonstrations

We propose Automatic Curricula via Expert Demonstrations (ACED), a reinf...

Task Phasing: Automated Curriculum Learning from Demonstrations

Applying reinforcement learning (RL) to sparse reward domains is notorio...

School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

Pommerman is a hybrid cooperative/adversarial multi-agent environment, w...

Learning to Navigate the Web

Learning in environments with large state and action spaces, and sparse ...

Backplay: "Man muss immer umkehren"

A long-standing problem in model free reinforcement learning (RL) is tha...

Policy Learning Using Weak Supervision

Most existing policy learning solutions require the learning agents to r...

1 Introduction

Training an agent for understanding and following human language instructions to complete a task is an active area of research. The end goal is to have an agent that is able to take in a task in human language and execute it in the environment. More recently, several such environments have been developed that are easy to use and experiment on by an end user (Hermann et al. [2017], Yu et al. [2018] and Chevalier-Boisvert et al. [2019]).

One such platform is BabyAI Chevalier-Boisvert et al. [2019], which is presented as a tool to study the sample efficiency of different algorithms for this task. It comprises of multiple levels, with each level having a gridworld environment, and an agent, that is given a task in natural language that it has to accomplish. Each level is comprised of different challenges (like navigation, mazes, distractors, door unlocking, sequencing of tasks etc.) that the neural agent has to learn.

Research in this area has moved on from just training an agent to complete a task. Nowadays, the focus is a lot on visualizing and understanding how the agent learns. One of the areas being explored is training the agent in a way that is similar to how humans learn a particular task. The best example of this is curriculum learning, i.e training a model for a simpler task first so that it can be trained on a complex task more easily. Bengio et al. [2009]

and many others have shown that this does work for neural networks. Another of the methods that humans use to solve complex tasks is to start off from the goal and work backwards.

Florensa et al. [2017] propose a technique where a robots can be trained on a curriculum based on this idea.

This paper proposes a algorithm that is inspired by this method that humans learn complex tasks. Our method comprises of a curriculum, that trains the agent on tasks right next to the goal first, introducing tasks that are increasingly more difficult. To obtain these tasks, we use information from expert demonstrations. Our method shows improvements on BabyAI levels whose demonstrations are not extremely long and require a decent amount of exploration by the agent to solve.

2 Related Work

One category of methods for instruction following is imitation learning, where a neural network maps each input to the action to execute. This category ranges from simple methods like behavioral cloning to more complex ones like DAGGER

Ross et al. [2010]

that have a human involved in the process. Another family of methods are reinforcement learning based, where the agent has to learn a policy

Sutton et al. [2000] or a Q-function Watkins and Dayan [1992]. A final class of methods which do not directly model the policy exist, under which methods like maximum entropy inverse reinforcement learning Ziebart et al. [2008] and GAIL Ho and Ermon [2016] fall in.

Bengio et al. [2009] proposed curriculum learning and showed that pre-training an LSTM (for language modelling) with easier tasks meant training on more complex tasks could happen much faster. There have been multiple attempts of applying curriculum learning to a reinforcement learning task. Most of these methods are concerned with determining the order to present the tasks of the curriculum. Methods have been proposed where descriptions of the task are used to construct a directed acyclic graph of tasks (Svetlik et al. [2017]) or an object oriented representation of the tasks (Silva and Costa [2018]), from which the curriculum is obtained. Matiisen et al. [2017] (Teacher-Student curriculum learning), Graves et al. [2017] and Narvekar et al. [2017] (Curriculum policies) all propose methods where the task of selecting the next task to train on is viewed as an RL problem in itself. Most of these methods rely on a curriculum already having been developed by the user. Florensa et al. [2017] propose a method where the curriculum is formulated by exploring the states of the environment and ordering them.

Some works have also proposed the use expert demonstrations during the reinforcement learning training process. Nair et al. [2018] used it to speedup the training process on robotics tasks with sparse reward functions. They store the demonstrations in an additional replay buffer from which they sample from during a minibatch. They also use the states from the demonstrations to reset the agent to. Both these speed up the initial phase of the training process, where the agent ends up getting a zero reward quite frequently. The work of Hester et al. [2017] also use demonstrations via the form of a replay buffer that they sample from. Finally, the work of Resnick et al. [2018] propose a technique called Backplay where they use a single demonstration to create a curriculum for the agent.

3 Reverse Curriculum Learning

3.1 Existing Work

Florensa et al. [2017] propose Reverse Curriculum generation. In this method, the agent starts off from a state right next to the goal state and begins a random walk. The states that are at one step away form the first stage of the curriculum, two steps form the second stage and so on. In this way, a curriculum is built without needing any intervention from the user. Their algorithm is described in 1.

Data: : final state of agent, : number of stages in curriculum wanted
Result: : list of stages, with each stage as a list of start states
[ [ ] repeated times] [ repeated times] for i = 1 to n_stages do
       [ ] for state in states do
             Sample random .step() .append()
Algorithm 1 Reverse Curriculum Learning

3.2 Applying existing methods to our environment

In Florensa et al. [2017]’s work, the above mentioned technique achieved good results on robotic arm movement type tasks. These tasks are different from our case as they a state and action space that is continuous. BabyAI’s gridworld environment has a state and action space that is discrete. We observed that if an agent starts a random walk form near the goal, it ends up at the goal state quite frequently (see Table 1), rendering that path generated useless. This exploration technique is not the right one to use in the case of a discrete environment.

Env/Steps 1 2 3 4 5
GoToLocal 34.4 10.2 4.5 3.6 3.1
PutNextLocal 26.5 6.3 1.8 0.01 0.01
Table 1: % of times the agent reaches the goal when the it starts 1,2,3,4 and 5 steps away and performs a random walk

3.3 Using demos to generate a curriculum

To alleviate this, we come up with a method that uses the demonstrations generated by BabyAI’s heuristic expert. Each demonstration comprises of a set of actions to perform from a start state to reach the end/goal state. Traversing through each demo, we obtain the states that are one step/action away from the goal state and make this the first stage of the curriculum. We repeat this process and obtain steps that are 2 steps away from the goal state and make this the second stage of the curriculum. This process is repeated to obtain the entire curriculum. The algorithm is detailed in

2. Figure 1 has a graphical depiction of how the curriculum is generated.

Figure 1: Generation of curriculum stages from demos. Triangles are start states, Squares are goal states and Circles are intermediate states
Data: : array of demos generated by the bot
Result: : list of stages, with each stage as a list of start states
max([len() for in ]) - 1 len() [ [ ] repeated times] for demo in demos do
       len() - 1 for i = 1 to n_steps do
             Initialize with seed used for Step through times based on add a copy of to
Algorithm 2 Generating curriculum from demos

The curriculum is defined as an ordered set of stages, with each stage being a set of start states as obtained by Algorithm 2. The number of stages in the curriculum(i.e curriculum length) is determined by the maximum length of the demos given. This technique can be applied to any problem being solved by RL where a small number of demonstrations are available. The ordering of tasks in this method is similar to how humans learn to perform complex tasks, by starting off from something closer to the goal and working backwards.

We call our modified algorithm RCPPO, as it builds up on PPO Schulman et al. [2017]. PPO is modified in such a way that each time the agent resets after the end of an episode, it is allowed to only reset to the set of states allowed as per the stage of the curriculum it is in. This is very similar to Nair et al. [2018], where the agent may chose to reset to any random state or a state from the demonstration. Once all the stages in the curriculum have been completed, the agent is allowed to reset to any state in the environment.

4 Experiments

We compared our algorithm against an implementation of PPO, which has built been built up on Policy Gradients Sutton et al. [2000]. We tested our algorithm on the BabyAI as it has a large number of levels, each level composed of different challenges that the network has to learn. For each level, we report the number of frames needed to be seen before the agent reaches a particular accuracy (0.95/0.99) on that level. For methods that use the curriculum, we check for accuracy only after the curriculum has been completed (i.e on the task where it is allowed to reset to any state on the grid). Success in BabyAI is defined as the agent getting a reward greater than 0, which happens as long as it reaches the goal state.

We ran 2 versions of RCPPO, with the difference between them being in the criterion used to determine when to move to the next stage of the curriculum. The first variant would make a curriculum update when the success for that stage hit 90% percent. The second variant would set this threshold to 70% and gradually increase it to 99% as the stages in the curriculum pass through. We report the mean of the two methods in the results. We evaluated other methods for this, methods like Matiisen et al. [2017] (refer to Teacher-Student Curriculum Learning in Appendix), but determined that a simple method like this was enough.

In our experiments, we used 1000 demonstrations from the expert for all the levels. We also performed a study on how many demonstrations are needed to do well enough (see Effect of number of demos used to build curriculum in Appendix), however that is not the main focus of our work.

Code containing implementation of the above technique is available at 111Code:

Small Levels (For 0.99 Accuracy)
GoToRedBall 34816 41728
PickupLoc 95488 205056
GoToObjMaze 134144 176128
GoToLocal 138240 140288
PickupLocal 177920 205056
PutNextLocal 804096 580608
UnlockPickup - 15360
BlockedUnlockPickup - 26112
UnlockPickupDist - 366080
Large Levels (For 0.95 Accuracy)
Open 72960 101632
GoTo 577536 787712
Pickup 762368 -
Table 2: # Frames (in100s) to reach that accuracy. "-" indicates level was not solvable

5 Results and Analysis

We divided the levels into 2 categories: small and large, based on the number of rooms in the environment. We report our findings for both separately

5.1 RCPPO on small levels

Figure 2: RCPPO on small levels

Small levels were levels with 1 room(GoToLocal, GoToRedBall etc..) or 2 rooms(UnlockPickup, BlockedUnlockPickup). For these levels, we report the number of frames to reach 0.99 accuracy, as that is easily achievable by our baseline.

  1. Levels like GoToRedBall, GoToLocal etc.., show no improvement by using our technique.

  2. Levels like PutNextLocal, UnlockPickup etc.. show significant improvements by using our method. In fact, levels like UnlockPickup, UnlockPickupDist and BlockedUnlockPickup are unsolvable by vanilla PPO but our method is able to solve them extremely easily.

Levels like GoToRedBall, GoToLocal are extremely simple. These require the agent to only perform 2 or 3 (movement related) of it’s 7 possible actions to solve each level. We theorize that a neural network is easily able to learn this and hence our method is ineffective.

On the more complex levels, the agent has to execute more of the actions from it’s space of possible actions to solve each level. The mean demo length for these levels also ends up being larger (refer to Table 3 in Appendix). Our method is also able to handle the problem of having a lot of zero reward episodes in the initial training stages (as described in Nair et al. [2018]). Our curriculum is able to help the agent handle all of these and speed up it’s training.

5.2 RCPPO on large levels

(a) GoTo
(b) Open
Figure 3: RCPPO on large levels, with variations

Large levels were levels with more than 2 rooms. These levels are characterized by extremely long demos produced by the bot (3). For these levels, we report the number of frames to reach 0.95 accuracy, as reaching 0.99 was difficult, even for the baseline method. For our analysis, we look at the performance of GoTo, as the performance for other levels were similar to it.

In the case of GoTo (RCPPO_1 in 2(a)), the curriculum ends at an extremely late stage, due to which the technique does not show result in an improvement. This is very much akin to a statement made in Bengio et al. [2009], where they say that using curriculum learning will result in the model needing to see more examples while training.

To combat this problem, we implemented a method where we combined consecutive stages down to one stage, to reduce the curriculum length. We tried multiple methods, with combining 5 stages into one, 10 stages into one and finally in an exponential sense (1, 2, 4…. so on). As is evident from 2(a), this method also did not result in much improvement. We conclude that even after combining stages together, the tasks within each stage are too varied and the agent isn’t able to learn effectively.

Similar performance is observed for other large levels. We conclude that our method is not effective when the mean demo length is very large.

6 Conclusions

We present our algorithm RCPPO, that takes in some demonstrations from the an expert, and uses it to build a curriculum to train an agent for instruction following. We observe improvements on levels where the agent has to execute a wide variety of actions to reach the goal, as long as the number of actions to reach the goal is not too large.

Our work builds up on top of the work by Florensa et al. [2017]. Their method needed a bit of tweaking to work in the action and state spaces present in our case. Rather than using a random walk to expand the state space for further stages of the curriculum, we use the information form the demos and constrain the expansion to the states in the demos. We end up getting a fixed length curriculum for our task.

In our work, we were not able to evaluate our technique on the most difficult of levels in BabyAI. Although our experiments with GoTo suggest that it is unlikely to work on more complex levels, a more refined version of our method could be tried out on the larger levels like SeqToSeq and PutNext. One could look at a particular start state in the curriculum and use the series of observations that were obtained while traversing through the demonstration to obtain that start state and use this information in a recurrence, similar to how memory is incorporated into policy gradients by Wierstra et al. [2007]. This information may help is solving levels that have long demonstrations.

Although a short study on it is presented in the appendix Effect of number of demos used to build curriculum, we have not done extensive analysis on how the number of demos used affects learning. This is also something that could be looked into.


We would like to thank Léonard Boussioux and David Yu-Tung Hui for their inputs and motivation during the experimentation for this work. We would like to thank Sebastin Santy for his feedback on the draft of the paper. We would also like to thank Compute Canada for providing the GPU resources used to run the experiments for this work.


  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    ICML ’09, New York, NY, USA, pp. 41–48. External Links: ISBN 978-1-60558-516-1, Link, Document Cited by: §1, §2, §5.2.
  • M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio (2019) BabyAI: first steps towards grounded language learning with a human in the loop. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1.
  • C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel (2017) Reverse curriculum generation for reinforcement learning. In Proceedings of the 1st Annual Conference on Robot Learning, S. Levine, V. Vanhoucke, and K. Goldberg (Eds.), Proceedings of Machine Learning Research, Vol. 78, , pp. 482–495. External Links: Link Cited by: §1, §2, §3.1, §3.2, §6.
  • A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu (2017) Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1311–1320. Cited by: §2.
  • K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, et al. (2017) Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551. Cited by: §1.
  • T. Hester, M. Vecerík, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys (2017) Learning from demonstrations for real world reinforcement learning. CoRR abs/1704.03732. External Links: Link, 1704.03732 Cited by: §2.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §2.
  • T. Matiisen, A. Oliver, T. Cohen, and J. Schulman (2017) Teacher-student curriculum learning. arXiv preprint arXiv:1707.00183. Cited by: §2, §4, Teacher-Student Curriculum Learning.
  • A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 6292–6299. External Links: Document, ISSN 2577-087X Cited by: §2, §3.3, §5.1.
  • S. Narvekar, J. Sinapov, and P. Stone (2017) Autonomous task sequencing for customized curriculum design in reinforcement learning. In

    Proceedings of the 26th International Joint Conference on Artificial Intelligence

    pp. 2536–2542. Cited by: §2.
  • C. Resnick, R. Raileanu, S. Kapoor, A. Peysakhovich, K. Cho, and J. Bruna (2018) Backplay:" man muss immer umkehren". arXiv preprint arXiv:1807.06919. Cited by: §2.
  • S. Ross, G. J. Gordon, and J. A. Bagnell (2010) A reduction of imitation learning and structured prediction to no-regret online learning. arXiv preprint arXiv:1011.0686. Cited by: §2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.3.
  • F. L. D. Silva and A. H. R. Costa (2018) Object-oriented curriculum generation for reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1026–1034. Cited by: §2.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2, §4.
  • M. Svetlik, M. Leonetti, J. Sinapov, R. Shah, N. Walker, and P. Stone (2017) Automatic curriculum graph generation for reinforcement learning agents. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.
  • C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §2.
  • D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber (2007) Solving deep memory pomdps with recurrent policy gradients. In International Conference on Artificial Neural Networks, pp. 697–706. Cited by: §6.
  • H. Yu, H. Zhang, and W. Xu (2018) Interactive grounded language acquisition and generalization in a 2d world. arXiv preprint arXiv:1802.01433. Cited by: §1.
  • B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. Cited by: §2.


Statistics on demonstrations

We computed the mean and max length of demos for different levels. The max length of the demos corresponds to the length of the curriculum for that level.

Level Avg Max
GoToRedBall 6.2 21
GoToLocal 6.4 23
GoToRedBallGrey 6.8 24
PickupLocal 7.0 23
PutNextLocal 13.0 56
UnlockLocal 15.1 28
UnlockPickup 20.4 31
UnlockPickupDist 26.7 68
Open 30.1 186
BlockedUnlockPickup 35.3 47
GoTo 52.9 208
Pickup 53.9 209
GoToSeq 72.2 265
PutNext 90.2 237
Table 3: Avg and Max length of demos. Determines curriculum lengths

Effect of number of demos used to build curriculum

We perform a study to see how the performance of training the agent changes when different number of demos are used to build a curriculum. To this end, we perform training on PutNextLocal with 100,500,1000 and 2000 demos. We report the same metric as before, number of frames to reach a particular success level, and this time, we include that statistic for success rates of 0.9 and 0.95 as well. As is evident, the number of demos used does not seem to have a significant impact on the training, with all our results being close to each other.

#Demos 0.9 0.95 0.99
100 300800 348672 584448
500 348160 502016 655616
1000 288256 371712 580608
2000 374016 479232 611840
Table 4: #Frames to reach particular success level
Figure 4: Training curves using different #demos

Teacher-Student Curriculum Learning

We used teacher student curriculum learning (Matiisen et al. [2017]) to determine when/what curriculum stage to change to. We evaluated this for GoTo, the level on which our algorithm was struggling. We did not see any improvement. Training curves are shown in 5 depict 2 different versions of teacher student curriculum against PPO.

Figure 5: Teacher Student curriculum learning on RCPPO vs vanilla PPO