Log In Sign Up

Learning Montezuma's Revenge from a Single Demonstration

by   Tim Salimans, et al.

We propose a new method for learning from a single demonstration to solve hard exploration tasks like the Atari game Montezuma's Revenge. Instead of imitating human demonstrations, as proposed in other recent works, our approach is to maximize rewards directly. Our agent is trained using off-the-shelf reinforcement learning, but starts every episode by resetting to a state from a demonstration. By starting from such demonstration states, the agent requires much less exploration to learn a game compared to when it starts from the beginning of the game at every episode. We analyze reinforcement learning for tasks with sparse rewards in a simple toy environment, where we show that the run-time of standard RL methods scales exponentially in the number of states between rewards. Our method reduces this to quadratic scaling, opening up many tasks that were previously infeasible. We then apply our method to Montezuma's Revenge, for which we present a trained agent achieving a high-score of 74,500, better than any previously published result.


Backplay: "Man muss immer umkehren"

A long-standing problem in model free reinforcement learning (RL) is tha...

Reinforcement learning with Demonstrations from Mismatched Task under Sparse Reward

Reinforcement learning often suffer from the sparse reward issue in real...

Lifelong Inverse Reinforcement Learning

Methods for learning from demonstration (LfD) have shown success in acqu...

Guided Exploration with Proximal Policy Optimization using a Single Demonstration

Solving sparse reward tasks through exploration is one of the major chal...

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

Reinforcement learning (RL) agents improve through trial-and-error, but ...

Deep Q-learning from Demonstrations

Deep reinforcement learning (RL) has achieved several high profile succe...

Never Give Up: Learning Directed Exploration Strategies

We propose a reinforcement learning agent to solve hard exploration game...

1 Introduction

Model-free reinforcement learning is learning by trial and error. Methods such as policy gradients (Williams (1992); Sutton et al. (2000); Kakade (2002)) and Q-learning (Watkins and Dayan (1992); Mnih et al. (2015)) explore an environment by taking random actions. If, by chance, the random actions lead to success, formulated as achieving a reward, they are reinforced

and the agent becomes more likely to take these beneficial actions again in the future. This works well if rewards are frequent enough for random actions to lead to a reward with reasonable probability. Unfortunately, many of the tasks we would like to solve don’t have such dense rewards, instead requiring long sequences of very specific actions to achieve any success. Such sequences are extremely unlikely to occur randomly.

Consider a task where it takes a precise sequence of actions to achieve the first reward. If each of those actions is taken with a fixed probability, a random agent will need to explore this environment for a duration that scales as before it can expect to experience the first reward. A well-known example of such a task is the Atari game Montezuma’s Revenge, where the goal is to navigate a series of chambers to collect keys, diamonds, and other items, while evading various opponents and traps. In this game, the probability of achieving the first reward of the game, the key in the first room, can be decomposed as

By multiplying of these probabilities together, we end up with the resulting probability that is exponentially smaller than any of the individual input probabilities. As a result, taking uniformly random actions in Montezuma’s Revenge only produces a reward about once in every half a million steps. This exponential scaling of reinforcement learning severely limits the tasks current RL techniques can solve.

Prior works (Dearden et al. (1999); Kolter and Ng (2009); Tang et al. (2017); Ostrovski et al. (2017); Chen et al. (2017); Pathak et al. (2017); O’Donoghue et al. (2016); Schulman et al. (2017a); Haarnoja et al. (2017); Nachum et al. (2017)) have proposed various methods of overcoming the exploration problems of naive model-free RL. However, these methods have so far produced only limited gains for solving hard exploration tasks such as Montezuma’s Revenge. In this work, we instead consider using a single demonstration of successful completion of the task to aid exploration in reinforcement learning. Most previous work on learning from demonstrations has so far focused on imitation learning, where the agent is trained to imitate the demonstration. A downside of this approach is that many different demonstrations are required for the agent to learn to generalize, and that these demonstrations need to be of high quality to avoid learning a sub-optimal solution. Here, we instead show that it is feasible to learn to solve sparse reward problems like Montezuma’s Revenge purely by RL, by bypassing the exploration problem through starting each episode from a carefully selected state from the demonstration.

2 Method

Although model-free RL methods have difficulty finding long sequences of actions, they work well for shorter sequences. The main insight behind our proposed method is that we can make a task easier to solve by decomposing it into a curriculum of subtasks requiring short action sequences. We construct this curriculum by starting each RL episode from a demonstration state. We implement this idea in a distributed setting, using parallel rollout workers that collect data by acting in an environment according to a shared RNN policy , after which the data is fed to a centralized optimizer to learn an improvement of the policy.

Given a previously recorded demonstration , our approach works by letting each parallel rollout worker (Algorithm 1) start its episode from a state in the demonstration. Early on in training, all workers start from states at times near the end of the demonstration at time . These reset points are then gradually moved back in time as training proceeds. The data produced by the rollout workers is fed to a central optimizer (Algorithm 2) that updates the policy using an off-the-shelf RL method such as PPO (Schulman et al. (2017b)), A3C (Mnih et al. (2016)), or Impala (Espeholt et al. (2018)). In addition, the central optimizer calculates the proportion of rollouts that beat or at least tie the score of the demonstrator on the corresponding part of the game. If the proportion is higher than a threshold , we move the reset point backward in the demonstration.

Within each iteration of Algorithm 1, the rollout workers obtain the latest policy and the central reset point from the optimizer. Each worker then samples a local starting point from a small set of time steps to increase diversity. In each episode, we first initialize the agent’s RNN policy’s hidden states by taking actions based on the demonstration segment directly preceding the local starting point , after which the agent takes actions based on the current policy . The demonstration segment used for initializing the RNN states is masked out in the data used for training, such that it does not contribute to the gradient used in the policy update. At the end of each episode, we increment a success counter if the current episode achieved a high score compared to the score in demonstration, which is then used to decrease the central starting point at the right speed.

Training proceeds until the central reset point has reached the beginning of the game, i.e. , so that the agent is succeeding at the game without using the demonstration at all. At this point we have an RL-trained agent beating or tying the human expert demonstration on the entire game.

1:Input: a human demonstration , number of starting points , effective RNN memory length , batch rollout length .
2:Initialize starting point by sampling uniformly from
3:Initialize environment to demonstration state
4:Initialize time counter
5:while TRUE do
6:     Get latest policy from optimizer
7:     Get latest reset point from optimizer
8:     Initialize success counter
9:     Initialize batch
10:     for step in  do
11:         if  then
12:              Sample action
13:              Take action in the environment
14:              Receive reward , next state and done signal
15:               We can train on this data
16:         else Replay demonstration to initialize RNN state of policy
17:              Copy data from demonstration .
18:               We should mask out this transition in training
19:         end if
20:         Add data to batch
21:         Increment time counter
22:         if  TRUE then
23:              if  then As good as demo
25:              end if
26:              Sample next starting starting point uniformly from
27:              Set time counter
28:              Reset environment to state
29:         end if
30:     end for
31:     Send batch and counter to optimizer
32:end while
Algorithm 1 Demonstration-Initialized Rollout Worker
1:Input: number of parallel agents , starting point shift size , success threshold , initial parameters , demonstration length , learning algorithm (e.g. PPO, A3C, Impala, etc.)
2:Set the reset point to the end of the demonstration
3:Start rollout workers
4:while  do
5:     Gather data from rollout workers
6:     if  then The workers are successful sufficiently often
8:     end if
9:      Make sure to mask out demo transitions
10:     Broadcast to rollout workers
11:end while
Algorithm 2 Optimizer

By slowly moving the starting state from the end of the demonstration to the beginning, we ensure that at every point the agent faces an easy exploration problem where it is likely to succeed, since it has already learned to solve most of the remaining game. We can interpret solving the RL problem in this way as a form of dynamic programming (Bagnell et al., 2004). If a specific sequence of actions is required to reach a reward, this sequence may now be learned in a time that is quadratic in , rather than exponential. Figure 2 demonstrates this intuition in Montezuma’s Revenge.

Figure 1: Impression of our agent learning to reach the first key in Montezuma’s Revenge using RL and starting each episode from a demonstration state. When our agent starts playing the game, we place it right in front of the key, requiring it to only take a single jump to find success. After our agent has learned to do this consistently, we slowly move the starting point back in time. Our agent might then find itself halfway up the ladder that leads to the key. Once it learns to climb the ladder from there, we can have it start at the point where it needs to jump over the skull. After it learns to do that, we can have it start on the rope leading to the floor of the room, etc. Eventually, the agent starts in the original starting state of the game and is able to reach the key completely by itself.

3 Related work

3.1 State resetting

Starting episodes by resetting from demonstration states was previously proposed by Hosu and Rebedea (2016) for learning difficult Atari games. However, their method did not construct a curriculum that gradually moves the starting state back from the end of the demonstration to the beginning. We found such a curriculum to be vitally important for deriving any benefit from the demonstration.

The idea of constructing a learning curriculum by starting each RL episode from a sequence of increasingly more difficult starting points was used recently by Florensa et al. (2017) for an application in robotics. Rather than selecting states from a demonstration, the authors construct the curriculum by iteratively perturbing a set of starting states using random actions and then selecting the resulting states with the right level of difficulty.

After the release of an early version of this work, Resnick et al. (2018) published concurrent work exploring the construction of a curriculum from demonstration states, showing promising results in both single-agent and multi-agent tasks. Their curriculum also starts with states near the end of the game and moves gradually to the beginning. In contrast to our dynamically learned curriculum, theirs is predefined for each task. We also explored this in our experiments but found the dynamic adjustment of the starting state to be crucial to achieve good results on difficult tasks like Montezuma’s Revenge.

3.2 Imitation Learning

Another relevant research direction aims to solve the exploration problem via imitation of a human expert. One example is the work by Peng et al. (2018) which uses imitation learning to mimic demonstrated movements for use in physics-based animation. Another example is the work by Nair et al. (2018) that combines demonstration-based imitation learning and reinforcement learning to overcome exploration problems for robotic tasks.

Recently, several researchers successfully demonstrated an agent learning Montezuma’s Revenge by imitation learning from a demonstration. Aytar et al. (2018) train an agent to achieve the same states seen in a YouTube video of Montezuma’s Revenge, where Pohlen et al. (2018) combine a sophisticated version of Q-learning with maximizing the likelihood of actions taken in a demonstration. Garmulewicz et al. (2018) proposed the expert-augmented actor critic method, which combines the ACKTR policy gradient optimizer (Wu et al. (2017)) with an extra loss term based on supervision from expert demonstrations, also obtaining strong results on Montezuma’s Revenge.

The advantage of approaches based on imitation is that they do not generally require as much control over the environment as our technique does: they do not reset the environment to states other than the starting state of the game, and they do not presume access to the full game states encountered in the demonstration. Our method differs by directly optimizing what we care about — the game score, rather than making the agent imitate the demonstration; our method can thus learn from a single demonstration without overfitting, is more robust to potentially sub-optimal demonstrations, and could offer benefits in multi-agent settings where we want to optimize performance against other opponents than the ones seen in the demonstration.

4 Experiments

We test our method on two environments. The first is the blind cliff walk environment of Schaul et al. (2015), where we demonstrate that our method reduces the exponential exploration complexity of conventional RL methods to quadratic complexity. The second is the notoriously hard Atari game Montezuma’s Revenge, where we achieve a higher score than any previously published result. We discuss results, implementation details, and remaining challenges.

4.1 Blind cliff walk

To gain insight into our proposed algorithm we start with the blind cliff walk environment proposed by Schaul et al. (2015). This is a simple RL toy problem where the goal is for the agent to blindly navigate a one dimensional cliff. The agent starts in state 0 and has 2 available actions. One of these actions will take it to the next state, while the other action will make it fall off the cliff, at which point it needs to start over. Only when the end of the cliff is reached (the last state of states) does the agent receive a reward. We assume there is no way for the agent to generalize across the different states, so that the agent has to learn a tabular policy.

Figure 2: Visualization of the blind cliff walk problem. A single correct sequence of actions traverses all states, leading to a reward. The agent learns a tabular policy for this problem and does not generalize between states. We can vary the problem size to make the problem more or less difficult.

As explained by Schaul et al. (2015), naive application of RL to this problem suffers a run time than scales exponentially in the problem size . The reason is that takes the agent on the order of random steps to achieve a reward so that it can learn. Fortunately we can do better when we’re given a demonstration of an agent successfully completing the cliff walk. We start by having the agent start each episode at state , the second to last state from the demonstration. This means it only needs to take a single right action to receive a reward, so learning is instant. After learning has been successful from state , we can start our episodes at state , etc. This is expected to give a total run time that scales quadratically in the problem size , as the agent needs to take on the order of steps to learn what to do in each of the states. This is a huge improvement over the exponential scaling of the naive RL implementation. Empirically, Figure 3 shows that this advantage in scaling indeed holds in practice for this simple example.

Figure 3: Number of steps required by a standard policy gradients optimizer to learn a policy that solves the blind cliff walk problem with

probability. We compare starting each episode from the initial game state versus starting each episode by selecting a state from the demonstration. The reported numbers are geometric means taken over 20 different random seeds. When starting from demonstration states selected using our proposed algorithm, the run time of the optimizer scales quadratically in the problem size. When starting each episode from the initial game state, as is standard practice, the run time scales exponentially in the problem size.

4.2 Montezuma’s Revenge


We provide a single demonstration to our algorithm that we recorded by playing the game tool-assisted, using a tool that allowed us to reverse game time and correct any mistakes we made. We have open-sourced the tool we built for this purpose. The score obtained in our demonstration is , corresponding to about minutes of playing time. Although the tool was necessary for us to reach this score as novice players, it is still more than a factor of 10 below the high scores reported by expert players on this game without using external tools.


Our agent for playing Montezuma’s Revenge is parameterized by a convolutional neural network, combining standard spatial convolutions with causal convolutions in the time dimension as used in Wavenet

(Van Den Oord et al., 2016). The agent receives grayscale observations of size which are first passed through a 2D convolutional layer with a kernel size of

, stride of

, and channel output. Subsequently we apply 3D convolutional layers where the kernel size in the spatial dimension is , , and . Each of these layers has a kernel size of 2 in the time dimension, with stride increasing exponentially as for layer . The number of channels is doubled at every layer. The resulting network is significantly larger than what is commonly used for reinforcement learning in the Atari environment, which we found to be necessary to learn to beat our lengthy demonstration.

We choose the reset point adjustment threshold used in Algorithm 2 as , such that we move the episode starting point back in time if at least of the rollout workers achieve returns comparable to the provided demonstration. We use PPO (Schulman et al., 2017b) to train the agent’s policy, and distribute learning over GPUs with 8 workers each, for a total of rollout workers. The agent was trained for about billion frames.


Our trained agent achieves a final score of over approximately minutes of play, a double-speed recording of which is available at We observe that although much of the agent’s game mirrors our demonstration, the agent surpasses the demonstration score of by picking up more diamonds along the way. In addition, the agent makes use of a feature of the game that was unknown to us: at minute 4:25 of the video sufficient time has passed for a key to re-appear, allowing the agent to proceed in a different way from the demonstration.

Table 1 compares our obtained score to results previously reported in the literature. Unfortunately, there is no standard way of evaluating performance in this setting: the game is deterministic, and different methods add different amounts of noise during action selection. Our result was achieved by sampling from a trained policy with low entropy. The amount of noise added is thus small, but comparable to the previous best results such as those by Pohlen et al. (2018) and Aytar et al. (2018).

Approach Score
Count-based exploration (Ostrovski et al. (2017)) 3,705.5
Unifying count-based exploration (Bellemare et al. (2016) ) 6,600
DQfD (Hester et al. (2017)) 4,739.6
Ape-X DQfD (Pohlen et al. (2018)) 29,384
Playing by watching Youtube (Aytar et al. (2018)) 41,098
Ours 74,500
Table 1: Score comparison on Montezuma’s Revenge

Like is often the case in reinforcement learning, we find that our trained neural net policy does not yet generalize at the level of a human player. One method to test for generalization ability proposed by Machado et al. (2017) is to perturb the policy by making actions sticky and repeating the last action with probability of 0.25 at every frame. Using this evaluation method our trained policy obtains a score of 10,000 on Montezuma’s Revenge on average. Alternatively, we can take random actions with probability (repeated for frameskipped steps), which leads to an average score of for our policy. Anecdotally, we find that such perturbations also significantly reduce the score of human players on Montezuma’s Revenge, but to a lesser extent. As far as we are aware, our results using perturbed policies are still better than all those published previously. Perturbing the learned policy by starting with between 0 and 30 random no-ops did not significantly hurt results, with the majority of rollouts achieving at least the final score obtained in our demonstration.

Training our agent to the reported result required 128 GPUs over a period of 2 weeks, which unfortunately made it impossible to quantify the consistency of our algorithms across runs. We leave a more systematic study of the reliability of the algorithm for future work.

Remaining Challenges

Although the step-by-step learning done by our agent is much simpler than learning to play from scratch, it is still far from trivial. One challenge our RL agent faces is that it is generally unable to reach the exact state from later on in a demonstration when it starts from an earlier state. This is because the agent plays the game at a different frameskip from what we used for recording the demonstration, but it is also due to the randomness in the actions which make it very unlikely to exactly reproduce any specific sequence of actions. The agent will thus need to be able to generalize between states that are very similar, but not identical. We found that this works well for Montezuma’s Revenge, but much less well for some other Atari games we tried, like Gravitar and Pitfall. One reason for this may be that these latter games require solving a harder vision problem: we found these games difficult to play from a downsampled screen ourselves, and we saw some improvement when using larger and deeper neural network policies.

Another challenge we encountered is that standard RL algorithms like policy gradients require striking a careful balance between exploration and exploitation: if the agent’s actions are too random, it makes too many mistakes to ever achieve the required final score when starting from the beginning of the game; if the actions are too deterministic, the agent stops learning because it does not explore alternative actions. Achieving the reported result on Montezuma’s Revenge thus required careful tuning of the coefficient of the entropy bonus used in PPO, in combination with other hyperparameters such as the learning rate and the scaling of rewards. For some other games like Gravitar and Pitfall we were unable to find hyperparameters that worked for training the full curriculum. We hope that future advances in RL will yield algorithms that are more robust to random noise and to the choice of hyperparameters.

5 Conclusion

Prior work on learning from demonstrations to solve difficult reinforcement learning tasks has focused mainly on imitation, which encourages identical behavior to that seen in the demonstration. In contrast, we propose a new method that optimizes returns directly. Our method breaks down a difficult exploration problem into a curriculum of subtasks, created by resetting from demonstration states. Our agent does not mimic the demonstrated behavior exactly and is able to find new and exciting solutions that the human demonstrator may not have considered, resulting in a higher score on Montezuma’s Revenge than obtained using previously published approaches.