Teacher-Student Curriculum Learning

07/01/2017 ∙ by Tambet Matiisen, et al. ∙ 0

We propose Teacher-Student Curriculum Learning (TSCL), a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train on. We describe a family of Teacher algorithms that rely on the intuition that the Student should practice more those tasks on which it makes the fastest progress, i.e. where the slope of the learning curve is highest. In addition, the Teacher algorithms address the problem of forgetting by also choosing tasks where the Student's performance is getting worse. We demonstrate that TSCL matches or surpasses the results of carefully hand-crafted curricula in two tasks: addition of decimal numbers with LSTM and navigation in Minecraft. Using our automatically generated curriculum enabled to solve a Minecraft maze that could not be solved at all when training directly on solving the maze, and the learning was an order of magnitude faster than uniform sampling of subtasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

TSCL

Teacher-Student Curriculum Learning code


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning algorithms have been used to solve difficult tasks in video games

(Mnih et al., 2015), locomotion (Schulman et al., 2015; Lillicrap et al., 2015) and robotics (Levine et al., 2015). But tasks with sparse rewards like “Robot, fetch me a beer” remain challenging to solve with direct application of these algorithms. One reason is that the number of samples needed to solve a task with random exploration increases exponentially with the number of steps to get a reward (Langford, 2011). One approach to overcome this problem is to use curriculum learning (Bengio et al., 2009; Zaremba and Sutskever, 2014; Graves et al., 2016; Wu and Tian, 2017), where tasks are ordered by increasing difficulty and training only proceeds to harder tasks once easier ones are mastered. Curriculum learning helps when after mastering a simpler task the policy for a harder task is discoverable through random exploration.

To use curriculum learning, the researcher must:

  • Be able to order subtasks by difficulty.

  • Decide on a mastery threshold. This can be based on achieving certain score (Zaremba and Sutskever, 2014; Wu and Tian, 2017), which requires prior knowledge of acceptable performance of each task. Alternatively this can be based on a plateau of performance, which can be hard to detect given the noise in the learning curve.

  • Continuously mix in easier tasks while learning harder ones to avoid forgetting. Designing these mixtures effectively is challenging (Zaremba and Sutskever, 2014).

In this paper, we describe a new approach called Teacher-Student Curriculum Learning (TSCL). The Student is the model being trained. The Teacher monitors the Student’s training progress and determines the tasks on which the Student should train at each training step, in order to maximize the Student’s progression through the curriculum. The Student can be any machine learning model. The Teacher is itself learning about the Student as it’s giving tasks, all as part of a single training session.

We describe several Teacher algorithms based on the notion of learning progress (Oudeyer and Kaplan, 2007). The main idea is that the Student should practice more the tasks on which it is making fastest progress i.e. the learning curve slope is highest. To counter forgetting, the Student should also practice tasks where the performance is getting worse i.e. the learning curve slope is negative.

The main contributions of the paper are:

  • We formalize TSCL, a Teacher-Student framework for curriculum learning as partially observable Markov decision process (POMDP).

  • We propose a family of algorithms based on the notion of learning progress. The algorithms also address the problem of forgetting previous tasks.

  • We evaluate the algorithms on two supervised and reinforcement learning tasks: addition of decimal numbers with LSTM and navigation in Minecraft.

2 Teacher-Student Setup

Figure 1: The Teacher-Student setup

Figure 1 illustrates the Teacher-Student interaction. At each timestep, the Teacher chooses tasks for the Student to practice on. The Student trains on those tasks and returns back a score. The Teacher’s goal is for the Student to succeed on a final task with as few training steps as possible. Usually the task is parameterized by a categorical value representing one of

subtasks, but one can imagine also multi-dimensional or continuous task parameterization. The score can be episode total reward in reinforcement learning or validation set accuracy in supervised learning.

We formalize the Teacher’s goal of helping the Student to learn a final task as solving a partially observable Markov decision process (POMDP). We present two POMDP formulations: (1) Simple, best suited for reinforcement learning; and (2) Batch, best suited for supervised learning.

2.1 Simple POMDP Formulation

The simple POMDP formulation exposes the score of the Student on a single task and is well-suited for reinforcement learning problems.

  • The state represents the entire state of the Student (i.e.neural network parameters and optimizer state) and is not observable to the Teacher.

  • The action corresponds to the parameters of the task chosen by Teacher. In following we only consider a discrete task parameterization. Taking an action means training Student on that task for certain number of iterations.

  • The observation is the score of the task the Student trained on at timestep , i.e. the episode total reward. While in theory the Teacher could also observe other aspects of the Student state like network weights, for simplicity we choose to expose only the score.

  • Reward is the change in score for the task the Student trained on at timestep : , where is the previous timestep when the same task was trained on.

2.2 Batch POMDP Formulation

In supervised learning a training batch can include multiple tasks. Therefore action, observation, and reward apply to the whole training set and scores can be measured on a held-out validation set. This motivates the batch formulation of the POMDP:

  • The state represents training state of the Student.

  • The action

    represents a probability distribution over

    tasks. Each training batch is sampled according to the distribution: where is the probability of task at timestep .

  • The observation is the scores of all tasks after the training step: In the simplest case the scores could be accuracies of the tasks in the training set. But in the case of minibatch training the model evolves during training and therefore additional evaluation pass is needed anyway to produce consistent results. Therefore we use a separate validation set that contains uniform mix of all tasks for this evaluation pass.

  • The reward is the sum of changes in evaluation scores from the previous timestep: .

This setup could also be used with reinforcement learning by performing training in batches of episodes. But because scoring one sample (one episode) in reinforcement learning is usually much more computationally expensive than in supervised learning, it makes sense to use simple POMDP formulation and make decision about the next task after each training step.

2.3 Optimization Criteria

For either of the POMDP formulations, maximizing the Teacher episode total reward is equivalent to maximizing the score of all tasks at the end of the episode: where is the last training step where task was being trained on111Due to telescoping summation cancelling out all terms but the th..

While an obvious choice for optimization criteria would have been the performance in the final task, initially the Student might not have any success in the final task and this does not provide any meaningful feedback signal to the Teacher. Therefore we choose to maximize the sum of performances in all tasks. The assumption here is that in curriculum learning the final task includes the elements of all previous tasks, therefore good performance in the intermediate tasks usually leads to good performance in the final task.

3 Algorithms

POMDPs are typically solved using reinforcement learning algorithms. But those require many training episodes, while we aim to train the Student in one Teacher episode. Therefore, we resort to simpler heuristics. The basic intuition is that the Student should practice those tasks more for which it is making most progress

(Oudeyer and Kaplan, 2007), while also practicing tasks that are at risk of being forgotten.

Figure 2: Idealistic curriculum learning. Left: Scores of different tasks improve over time, the next task starts improving once the previous task has been mastered. Right: Probability of sampling a task depends on the slope of the learning curve.

Figure 2 is a demonstration of the ideal training progress in a curriculum learning setting:

  1. At first, the Teacher has no knowledge so it samples from all tasks uniformly.

  2. When the Student starts making progress on task 1, the Teacher allocates more probability mass to this task.

  3. When the Student masters task 1, its learning curve flattens and the Teacher samples the task less often. At this point Student also starts making progress on task 2, so the Teacher samples more from task 2.

  4. This continues until the Student masters all tasks. As all task learning curves flatten in the end, the Teacher returns to uniform sampling of the tasks.

The picture above is idealistic, since in practice some unlearning often occurs, i.e. when most of the probability mass is allocated to the task 2, performance on task 1 might get worse. To counter this the Student should also practice all learned tasks, especially those where unlearning occurs. For this reason we sample tasks according to the absolute value of the slope of the learning curve instead. If the change in scores is negative, this must mean that unlearning occurred and this task should be practiced more.

This description alone does not prescribe an algorithm. We need to propose a method of estimating learning progress from noisy task scores, and a way to balance exploration and exploitation. We take inspiration from algorithms for the non-stationary multi-armed bandit problem

(Sutton and Barto, 1998) and adapt them to TSCL. For brevity we only give intuition for the simple formulation algorithms here, the formal descriptions can be found in appendices A and B.

3.1 Online algorithm

The Online algorithm is inspired by the basic non-stationary bandit algorithm (Sutton and Barto, 1998). It uses exponentially weighted moving average to track the expected return from different tasks:

where is learning rate. The next task can be chosen by -greedy exploration: sample a random task with probability , or otherwise.

Alternatively the next task can be chosen using Boltzmann distribution:

where is the temperature of Boltzmann distribution. For details, see Algorithm 1 in Appendix A.

3.2 Naive algorithm

To estimate the learning progress more reliably one should practice the task several times. The Naive algorithm trains each task

times, observes the resulting scores and estimates the slope of the learning curve using linear regression. The regression coefficient is used as the reward in the above non-stationary bandit algorithm. For details, see Algorithm

2 in Appendix A.

3.3 Window algorithm

Repeating the task a fixed number of times is expensive, when clearly no progress is made. The Window algorithm keeps FIFO buffer of last scores, and timesteps when these scores were recorded. Linear regression is performed to estimate the slope of the learning curve for each task, with the timesteps as the input variables. The regression coefficient is used as the reward in the above non-stationary bandit algorithm. For details, see Algorithm 3 in Appendix A.

3.4 Sampling algorithm

The previous algorithms require tuning of hyperparameters to balance exploration. To get rid of exploration hyperparameters, we take inspiration from Thompson sampling. The Sampling algorithm keeps a buffer of last

rewards for each task. To choose the next task, a recent reward is sampled from each task’s -last-rewards buffer. Then whichever task yielded the highest sampled reward is chosen. This makes exploration a natural part of the algorithm: tasks that have recently had high rewards are sampled more often. For details, see Algorithm 4 in Appendix A.

4 Experiments

4.1 Decimal Number Addition

Addition of decimal numbers with LSTM is a well known task that requires a curriculum to learn in reasonable time (Zaremba and Sutskever, 2014). It is implemented as sequence-to-sequence model (Sutskever et al., 2014), where the input to the network is two decimal-coded numbers separated by a ’plus’ sign, and output of the network is the sum of those numbers, also in decimal coding. The curriculum is based on the number of digits in the input numbers – it is easier to learn addition of short numbers and then move on to longer numbers.

Number addition is a supervised learning problem and therefore can be trained more efficiently by including several curriculum tasks in the mini-batch. Therefore we adopt batch training scheme as outlined in 2.2

. The score we use is the accuracy of each task calculated on validation set. The results shown below are means and standard deviations of 3 runs with different random seeds. Full experiment details can be found in appendix

C.

4.1.1 Addition with 1-dimensional Curriculum

We started with a similar setup to (Zaremba and Sutskever, 2014), where the curriculum task determines the maximum number of digits in both added numbers. The results are shown on Figure 4. Our algorithms outperformed uniform sampling and the best manual curriculum ("combined") for 9-digit addition from (Zaremba and Sutskever, 2014). An example of the task distribution during training session is given on figure 4.

Figure 3: Results for 9-digit 1D addition, lower is better. Variants using the absolute value of the expected reward surpass the best manual curriculum ("combined").
Figure 4: Progression of the task distribution over time for 9-digit 1D addition (Sampling). The algorithm progresses from simpler tasks to more complicated. Harder tasks take longer to learn and the algorithm keeps training on easier tasks to counter unlearning.

4.1.2 Addition with 2-dimensional Curriculum

We also experimented with a curriculum where the ordering of tasks is not obvious. We used the same decimal addition task, but in this case the length of each number is chosen separately, making the task-space 2-dimensional. Each training batch is modelled as a probability distribution over the length of both numbers . We also tried making this distribution independent such that , but that did not work as well.

There is no equivalent experiment in (Zaremba and Sutskever, 2014), so we created a manual curriculum inspired by their best 1D curriculum. In particular we increase difficulty by increasing the maximum length of both two numbers, which reduces the problem to a 1D curriculum. Figure 6 shows the results for 9-digit 2D addition. Figure 6 illustrates the different approaches taken by manual and automated curriculum.

Figure 5: Results for 9-digit 2D addition, lower is better. The task seems easier, manual curriculum is hard to beat and uniform sampling is competitive.
Figure 6: Accuracy progress for 4-digit 2D addition. Top: TSCL. Bottom: the best manual curriculum. Our algorithm takes distinctively different approach by training on shorter numbers first. 9-digit videos can be found https://youtu.be/y_QIcQ6spWk and https://youtu.be/fB2kx-esjgw.

4.1.3 Observations

  • Using absolute value of boosts the performance of almost all the algorithms, which means it is efficient in countering forgetting.

  • There is no universal best algorithm. For 1D the Window algorithm and for 2D the Naive algorithm performed the best. Sampling is competitive in both and has least hyperparameters.

  • Whether -greedy or Boltzmann exploration works better depends on the algorithm.

  • Uniform sampling is surprisingly efficient, especially in 2D case.

  • The 2D task is solved faster and the manual curriculum is hard to beat in 2D.

4.2 Minecraft

Minecraft is a popular 3D video game where players can explore, craft tools and build arbitrary structures, making it a potentially rich environment for AI research. We used the Malmo platform (Johnson et al., 2016) with OpenAI Gym wrapper222https://github.com/tambetm/gym-minecraft to interact with Minecraft in our reinforcement learning experiments. In particular we used ClassroomDecorator from Malmo to generate random mazes for the agent to solve. The mazes contain sequences of rooms separated by the following obstacles:

  • Wall – the agent has to locate a doorway in the wall.

  • Lava – the agent has to cross a bridge over lava.

We only implemented the Window algorithm for the Minecraft task, because other algorithms rely on score change, which is not straightforward to calculate for parallel training scheme. As baseline we use uniform sampling, training only on the last task, and a manually tuned curriculum. Full experimental details can be found in appendix D.

Figure 7: 5-step curriculum.

4.2.1 5-step Curriculum

We created a simple curriculum with 5 steps:

  1. A single room with a target.

  2. Two rooms separated by lava.

  3. Two rooms separated by wall.

  4. Three rooms separated by lava and wall, in random order.

  5. Four rooms separated by lava and walls, in random order.

Refer to Figure 7 for the room layout. The starting position of the agent and the location of the target were randomized for each episode. Manual curriculum trained first task for steps, second, third and fourth task for steps, and fifth task for steps.

Figure 8 shows learning curves for Minecraft 5-step curriculum. The mean curve and standard deviation are based on 3 runs with different random seeds.

Figure 8: Minecraft 5-step curriculum results, Y-axis shows mean episode reward per timesteps for the current task. Left: training performance, notice the manual curriculum task switches after , , and steps. For automatic curriculum the training score has no clear interpretation. Right: evaluation training on the last task. When training only on the last task the agent did not make any progress at all. When training on a uniform mix of the tasks the progress was slow. Manual curriculum allowed the agent to learn the last task to an acceptable level. TSCL is comparable to the manual curriculum in performance.

Video of the trained agent can be found here: https://youtu.be/cada0d_aDIc. The learned policy is robust to the number of rooms, given that obstacles are of the same type. The code is available at https://github.com/tambetm/TSCL.

5 Related Work

Work by (Bengio et al., 2009) sparked general interest in curriculum learning. More recent results include learning to execute short programs (Zaremba and Sutskever, 2014), finding shortest paths in graphs (Graves et al., 2016) and learning to play first-person shooter (Wu and Tian, 2017). All those works rely on manually designed curricula and do not attempt to produce it automatically.

The idea of using learning progress as the reward could be traced back to (Schmidhuber, 1991). It has been successfully applied in the context of developmental robotics to learn object manipulation (Oudeyer et al., 2007; Baranes and Oudeyer, 2013) and also in actual classroom settings to teach primary school students (Clement et al., 2015). Using learning progress as the reward can be linked to the concept of intrinsic motivation (Oudeyer and Kaplan, 2007; Schmidhuber, 2010).

Several algorithms for adversarial bandits were analyzed in (Auer et al., 2002). While many of those algorithms have formal worst-case guarantees, in our experiments they did not perform well. The problem is that they come with no assumptions. In curriculum learning we can assume that rewards change smoothly over time.

More recently (Sukhbaatar et al., 2017) proposed a method to generate incremental goals and therefore curricula automatically. The setup consists of two agents, Alice and Bob, where Alice is generating trajectories and Bob is trying to either repeat or reverse them. Similar work by (Held et al., 2017) uses generative adversarial network to generate goal states for an agent. Compared to TSCL, they are able to generate new subtasks on the go, but this mainly aids in exploration and is not guaranteed to help in learning the final task. (Sharma and Ravindran, 2017) apply similar setup as ours to multi-task learning. In their work they practice more tasks that are underperforming compared to preset baseline, as opposed to our approach of using learning progress. (Jain and Tulabandhula, 2017) estimate transfer between subtasks and target task, and create curriculum based on that.

The most similar work to ours was done concurrently in (Graves et al., 2017). While the problem statement is strikingly similar, our approaches differ. They apply the automatic curriculum learning only to supervised sequence learning tasks, while we consider also reinforcement learning tasks. They use the EXP3.S algorithm for adversarial bandits, while we propose alternative algorithms inspired by non-stationary bandits. They consider other learning progress metrics based on complexity gain while we focus only on prediction gain (which performed overall best in their experiments). Moreover, their work only uses uniform sampling of tasks as a baseline, whereas ours compares the best known manual curriculum for the given tasks. In summary they arrive to very similar conclusions to ours.

Decimal addition has also been explored in (Kalchbrenner et al., 2015; Reed and De Freitas, 2015; Kaiser and Sutskever, 2015), sometimes improving results over original work in (Zaremba and Sutskever, 2014). Our goal was not to improve the addition results, but to evaluate different curriculum approaches, therefore there is no direct comparison.

Minecraft is a relatively recent addition to reinforcement learning environments. Work by (Oh et al., 2016) evaluates memory-based architectures for Minecraft. They use cognition-inspired tasks in visual grid-world. Our tasks differ in that they do not need explicit memory, and the movement is continuous, not grid-world. Another work by (Tessler et al., 2016) uses tasks similar to ours but they take different approach: they learn a Deep Skill Module for each subtask, freeze weights of those modules and train hierarchical deep reinforcement learning network to pick either single actions or subtask policies. In contrast our approach uses simple policy network and relies on the TSCL to learn (and not forget) the subtasks.

While exploration bonuses (Bellemare et al., 2016; Houthooft et al., 2016; Stadie et al., 2015) solve the same problem of sparse rewards, they apply to Student algorithms, while we were considering different Teacher approaches. For this reason we leave the comparison with exploration bonuses to future work.

6 Conclusion

We presented a framework for automatic curriculum learning that can be used for supervised and reinforcement learning tasks. We proposed a family of algorithms within that framework based on the concept of learning progress. While many of the algorithms performed equally well, it was crucial to rely on the absolute value of the slope of the learning curve when choosing the tasks. This guarantees the re-training on tasks which the network is starting to forget. In our LSTM decimal addition experiments, the Sampling algorithm outperformed the best manually designed curriculum as well as the uniform sampling. On the challenging 5-task Minecraft navigation problem, our Window algorithm matched the performance of a carefully designed manual curriculum, and significantly outperformed uniform sampling. For problems where curriculum learning is necessary, TSCL can avoid the tedium of ordering the difficulty of subtasks and hand-designing the curriculum.

7 Future Work

In this work we only considered discrete task parameterizations. In the future it would be interesting to apply the idea to continuous task parameterizations. Another promising idea to explore is the usage of automatic curriculum learning in contexts where the subtasks have not been pre-defined. For example, subtasks can be sampled from a generative model, or taken from different initial states in the same environment.

8 Acknowledgements

We thank Microsoft for their excellent Malmö environment for Minecraft, Josh Tobin and Pieter Abbeel for suggestions and comments, Vicky Cheung, Jonas Schneider, Ben Mann and Art Chaidarun for always being helpful with OpenAI infrastructure. Also Raul Vicente, Ardi Tampuu and Ilya Kuzovkin from University of Tartu for comments and discussion.

References

Appendix A Simple versions of the algorithms

Initialize Student learning algorithm
Initialize expected return for all tasks
for t=1,…,T do
     Choose task based on using -greedy or Boltzmann policy
     Train Student using task and observe reward
     Update expected return
end for
Algorithm 1 Online algorithm
Initialize Student learning algorithm
Initialize expected return for all tasks
for t=1,…,T do
     Choose task based on using -greedy or Boltzmann policy
     Reset
     for k=1,…,K do
         Train Student using task and observe score
         Store score in list
     end for
     Apply linear regression to and extract the coefficient as
     Update expected return
end for
Algorithm 2 Naive algorithm
Initialize Student learning algorithm
Initialize FIFO buffers and with length for all tasks
Initialize expected return for all tasks
for t=1,…,T do
     Choose task based on using -greedy or Boltzmann policy
     Train Student using task and observe score
     Store score in and timestep in
     Use linear regression to predict from and use the coef. as
     Update expected return
end for
Algorithm 3 Window algorithm
Initialize Student learning algorithm
Initialize FIFO buffers with length for all tasks
for t=1,…,T do
     Sample reward from for each task (if then )
     Choose task
     Train Student using task and observe reward
     Store reward in
end for
Algorithm 4 Sampling algorithm

Appendix B Batch versions of the algorithms

Initialize Student learning algorithm
Initialize expected return for all tasks
for t=1,…,T do
     Create prob. dist. based on using -greedy or Boltzmann policy
     Train Student using prob. dist. and observe scores
     Calculate score changes
     Update expected return
end for
Algorithm 5 Online algorithm
Initialize Student learning algorithm
Initialize expected return for all tasks
for t=1,…,T do
     Create prob. dist. based on using -greedy or Boltzmann policy
     Reset for all tasks
     for k=1,…,K do
         Train Student using prob. dist. and observe scores
         Store score in list for each task
     end for
     Apply linear regression to each

and extract the coefficients as vector

     Update expected return
end for
Algorithm 6 Naive algorithm
Initialize Student learning algorithm
Initialize FIFO buffers with length for all tasks
Initialize expected return for all tasks
for t=1,…,T do
     Create prob. dist. based on using -greedy or Boltzmann policy
     Train Student using prob. dist. and observe scores
     Store score in for all tasks
     Apply linear regression to each and extract the coefficients as vector
     Update expected return
end for
Algorithm 7 Window algorithm
Initialize Student learning algorithm
Initialize FIFO buffers with length for all tasks
for t=1,…,T do
     Sample reward from for each task (if then )
     Create one-hot prob. dist. based on
     Mix in uniform dist. :
     Train Student using prob. dist. and observe scores
     Calculate score changes
     Store reward in for each task
end for
Algorithm 8 Sampling algorithm

Appendix C Decimal Number Addition Training Details

Our reimplementation of decimal addition is based on Keras [Chollet et al., 2015]. The encoder and decoder are both LSTMs with 128 units. In contrast to the original implementation, the hidden state is not passed from encoder to decoder, instead the last output of the encoder is provided to all inputs of the decoder. One curriculum training step consists of training on 40,960 samples. Validation set consists of 4,096 samples and 4,096 is also the batch size. Adam optimizer [Kingma and Ba, 2014]

is used for training with default learning rate of 0.001. Both input and output are padded to a fixed size.

In the experiments we used the number of steps until 99% validation set accuracy is reached as a comparison metric. The exploration coefficient was fixed to 0.1, the temperature was fixed to 0.0004, the learning rate was 0.1, and the window size was 10 in all experiments.

Appendix D Minecraft Training Details

The Minecraft task consisted of navigating through randomly generated mazes. The maze ends with a target block and the agent gets 1,000 points by touching it. Each move costs -0.1 and dying in lava or getting a timeout yields -1,000 points. Timeout is 30 seconds (1,500 steps) in the first task and 45 seconds (2,250 steps) in the subsequent tasks.

For learning we used the proximal policy optimization (PPO) algorithm [Schulman et al., 2017] implemented using Keras [Chollet et al., 2015] and optimized for real-time environments. The policy network used four convolutional layers and one LSTM layer. Input to the network was color image and outputs were two Gaussian actions: move forward/backward and turn left/right. In addition the policy network had state value output, which was used as the baseline. Figure 10 shows the network architecture.

Figure 9: Network architecture used for Minecraft.

For training we used a setup with 10 parallel Minecraft instances. The agent code was separated into runners, that interact with the environment, and a trainer, that performs batch training on GPU, similar to Babaeizadeh et al. [2016]. Runners regularly update their snapshot of the current policy weights, but they only perform prediction (forward pass), never training. After a fixed number of steps they use FIFO buffers to send collected states, actions and rewards to the trainer. Trainer collects those experiences from all runners, assembles them into batches and performs training. FIFO buffers shield the runners and the trainer from occasional hiccups. This also means that the trainer is not completely on-policy, but this problem is handled by the importance sampling in PPO.

Figure 10: Training scheme used for Minecraft.

During training we also used frame skipping, i.e. processed only every 5th frame. This sped up the learning considerably and the resulting policy also worked without frame skip. Also, we used auxiliary loss for predicting the depth as suggested in [Mirowski et al., 2016]. Surprisingly this resulted only in minor improvements.

For automatic curriculum learning we only implemented the Window algorithm for the Minecraft task, because other algorithms rely on score change, which is not straightforward to calculate for parallel training scheme. Window size was defined in timesteps and fixed to 10,000 in the experiments, exploration rate was set to 0.1.

The idea of the first task in the curriculum was to make the agent associate the target with a reward. In practice this task proved to be too simple - the agent could achieve almost the same reward by doing backwards circles in the room. For this reason we added penalty for moving backwards to the policy loss function. This fixed the problem in most cases, but we occasionally still had to discard some unsuccessful runs. Results only reflect the successful runs.

We also had some preliminary success combining continuous (Gaussian) actions with binary (Bernoulli) actions for "jump" and "use" controls, as shown on figure 10. This allowed the agent to learn to cope also with rooms that involve doors, switches or jumping obstacles, see https://youtu.be/e1oKiPlAv74.