1 Introduction
Model-free reinforcement learning is learning by trial and error. Methods such as policy gradients (Williams (1992); Sutton et al. (2000); Kakade (2002)) and Q-learning (Watkins and Dayan (1992); Mnih et al. (2015)) explore an environment by taking random actions. If, by chance, the random actions lead to success, formulated as achieving a reward, they are reinforced
and the agent becomes more likely to take these beneficial actions again in the future. This works well if rewards are frequent enough for random actions to lead to a reward with reasonable probability. Unfortunately, many of the tasks we would like to solve don’t have such dense rewards, instead requiring long sequences of very specific actions to achieve any success. Such sequences are extremely unlikely to occur randomly.
Consider a task where it takes a precise sequence of actions to achieve the first reward. If each of those actions is taken with a fixed probability, a random agent will need to explore this environment for a duration that scales as before it can expect to experience the first reward. A well-known example of such a task is the Atari game Montezuma’s Revenge, where the goal is to navigate a series of chambers to collect keys, diamonds, and other items, while evading various opponents and traps. In this game, the probability of achieving the first reward of the game, the key in the first room, can be decomposed as
By multiplying of these probabilities together, we end up with the resulting probability that is exponentially smaller than any of the individual input probabilities. As a result, taking uniformly random actions in Montezuma’s Revenge only produces a reward about once in every half a million steps. This exponential scaling of reinforcement learning severely limits the tasks current RL techniques can solve.
Prior works (Dearden et al. (1999); Kolter and Ng (2009); Tang et al. (2017); Ostrovski et al. (2017); Chen et al. (2017); Pathak et al. (2017); O’Donoghue et al. (2016); Schulman et al. (2017a); Haarnoja et al. (2017); Nachum et al. (2017)) have proposed various methods of overcoming the exploration problems of naive model-free RL. However, these methods have so far produced only limited gains for solving hard exploration tasks such as Montezuma’s Revenge. In this work, we instead consider using a single demonstration of successful completion of the task to aid exploration in reinforcement learning. Most previous work on learning from demonstrations has so far focused on imitation learning, where the agent is trained to imitate the demonstration. A downside of this approach is that many different demonstrations are required for the agent to learn to generalize, and that these demonstrations need to be of high quality to avoid learning a sub-optimal solution. Here, we instead show that it is feasible to learn to solve sparse reward problems like Montezuma’s Revenge purely by RL, by bypassing the exploration problem through starting each episode from a carefully selected state from the demonstration.
2 Method
Although model-free RL methods have difficulty finding long sequences of actions, they work well for shorter sequences. The main insight behind our proposed method is that we can make a task easier to solve by decomposing it into a curriculum of subtasks requiring short action sequences. We construct this curriculum by starting each RL episode from a demonstration state. We implement this idea in a distributed setting, using parallel rollout workers that collect data by acting in an environment according to a shared RNN policy , after which the data is fed to a centralized optimizer to learn an improvement of the policy.
Given a previously recorded demonstration , our approach works by letting each parallel rollout worker (Algorithm 1) start its episode from a state in the demonstration. Early on in training, all workers start from states at times near the end of the demonstration at time . These reset points are then gradually moved back in time as training proceeds. The data produced by the rollout workers is fed to a central optimizer (Algorithm 2) that updates the policy using an off-the-shelf RL method such as PPO (Schulman et al. (2017b)), A3C (Mnih et al. (2016)), or Impala (Espeholt et al. (2018)). In addition, the central optimizer calculates the proportion of rollouts that beat or at least tie the score of the demonstrator on the corresponding part of the game. If the proportion is higher than a threshold , we move the reset point backward in the demonstration.
Within each iteration of Algorithm 1, the rollout workers obtain the latest policy and the central reset point from the optimizer. Each worker then samples a local starting point from a small set of time steps to increase diversity. In each episode, we first initialize the agent’s RNN policy’s hidden states by taking actions based on the demonstration segment directly preceding the local starting point , after which the agent takes actions based on the current policy . The demonstration segment used for initializing the RNN states is masked out in the data used for training, such that it does not contribute to the gradient used in the policy update. At the end of each episode, we increment a success counter if the current episode achieved a high score compared to the score in demonstration, which is then used to decrease the central starting point at the right speed.
Training proceeds until the central reset point has reached the beginning of the game, i.e. , so that the agent is succeeding at the game without using the demonstration at all. At this point we have an RL-trained agent beating or tying the human expert demonstration on the entire game.
By slowly moving the starting state from the end of the demonstration to the beginning, we ensure that at every point the agent faces an easy exploration problem where it is likely to succeed, since it has already learned to solve most of the remaining game. We can interpret solving the RL problem in this way as a form of dynamic programming (Bagnell et al., 2004). If a specific sequence of actions is required to reach a reward, this sequence may now be learned in a time that is quadratic in , rather than exponential. Figure 2 demonstrates this intuition in Montezuma’s Revenge.

3 Related work
3.1 State resetting
Starting episodes by resetting from demonstration states was previously proposed by Hosu and Rebedea (2016) for learning difficult Atari games. However, their method did not construct a curriculum that gradually moves the starting state back from the end of the demonstration to the beginning. We found such a curriculum to be vitally important for deriving any benefit from the demonstration.
The idea of constructing a learning curriculum by starting each RL episode from a sequence of increasingly more difficult starting points was used recently by Florensa et al. (2017) for an application in robotics. Rather than selecting states from a demonstration, the authors construct the curriculum by iteratively perturbing a set of starting states using random actions and then selecting the resulting states with the right level of difficulty.
After the release of an early version of this work, Resnick et al. (2018) published concurrent work exploring the construction of a curriculum from demonstration states, showing promising results in both single-agent and multi-agent tasks. Their curriculum also starts with states near the end of the game and moves gradually to the beginning. In contrast to our dynamically learned curriculum, theirs is predefined for each task. We also explored this in our experiments but found the dynamic adjustment of the starting state to be crucial to achieve good results on difficult tasks like Montezuma’s Revenge.
3.2 Imitation Learning
Another relevant research direction aims to solve the exploration problem via imitation of a human expert. One example is the work by Peng et al. (2018) which uses imitation learning to mimic demonstrated movements for use in physics-based animation. Another example is the work by Nair et al. (2018) that combines demonstration-based imitation learning and reinforcement learning to overcome exploration problems for robotic tasks.
Recently, several researchers successfully demonstrated an agent learning Montezuma’s Revenge by imitation learning from a demonstration. Aytar et al. (2018) train an agent to achieve the same states seen in a YouTube video of Montezuma’s Revenge, where Pohlen et al. (2018) combine a sophisticated version of Q-learning with maximizing the likelihood of actions taken in a demonstration. Garmulewicz et al. (2018) proposed the expert-augmented actor critic method, which combines the ACKTR policy gradient optimizer (Wu et al. (2017)) with an extra loss term based on supervision from expert demonstrations, also obtaining strong results on Montezuma’s Revenge.
The advantage of approaches based on imitation is that they do not generally require as much control over the environment as our technique does: they do not reset the environment to states other than the starting state of the game, and they do not presume access to the full game states encountered in the demonstration. Our method differs by directly optimizing what we care about — the game score, rather than making the agent imitate the demonstration; our method can thus learn from a single demonstration without overfitting, is more robust to potentially sub-optimal demonstrations, and could offer benefits in multi-agent settings where we want to optimize performance against other opponents than the ones seen in the demonstration.
4 Experiments
We test our method on two environments. The first is the blind cliff walk environment of Schaul et al. (2015), where we demonstrate that our method reduces the exponential exploration complexity of conventional RL methods to quadratic complexity. The second is the notoriously hard Atari game Montezuma’s Revenge, where we achieve a higher score than any previously published result. We discuss results, implementation details, and remaining challenges.
4.1 Blind cliff walk
To gain insight into our proposed algorithm we start with the blind cliff walk environment proposed by Schaul et al. (2015). This is a simple RL toy problem where the goal is for the agent to blindly navigate a one dimensional cliff. The agent starts in state 0 and has 2 available actions. One of these actions will take it to the next state, while the other action will make it fall off the cliff, at which point it needs to start over. Only when the end of the cliff is reached (the last state of states) does the agent receive a reward. We assume there is no way for the agent to generalize across the different states, so that the agent has to learn a tabular policy.

As explained by Schaul et al. (2015), naive application of RL to this problem suffers a run time than scales exponentially in the problem size . The reason is that takes the agent on the order of random steps to achieve a reward so that it can learn. Fortunately we can do better when we’re given a demonstration of an agent successfully completing the cliff walk. We start by having the agent start each episode at state , the second to last state from the demonstration. This means it only needs to take a single right action to receive a reward, so learning is instant. After learning has been successful from state , we can start our episodes at state , etc. This is expected to give a total run time that scales quadratically in the problem size , as the agent needs to take on the order of steps to learn what to do in each of the states. This is a huge improvement over the exponential scaling of the naive RL implementation. Empirically, Figure 3 shows that this advantage in scaling indeed holds in practice for this simple example.

probability. We compare starting each episode from the initial game state versus starting each episode by selecting a state from the demonstration. The reported numbers are geometric means taken over 20 different random seeds. When starting from demonstration states selected using our proposed algorithm, the run time of the optimizer scales quadratically in the problem size. When starting each episode from the initial game state, as is standard practice, the run time scales exponentially in the problem size.
4.2 Montezuma’s Revenge
Demonstration
We provide a single demonstration to our algorithm that we recorded by playing the game tool-assisted, using a tool that allowed us to reverse game time and correct any mistakes we made. We have open-sourced the tool we built for this purpose. The score obtained in our demonstration is , corresponding to about minutes of playing time. Although the tool was necessary for us to reach this score as novice players, it is still more than a factor of 10 below the high scores reported by expert players on this game without using external tools.
Implementation
Our agent for playing Montezuma’s Revenge is parameterized by a convolutional neural network, combining standard spatial convolutions with causal convolutions in the time dimension as used in Wavenet
(Van Den Oord et al., 2016). The agent receives grayscale observations of size which are first passed through a 2D convolutional layer with a kernel size of, stride of
, and channel output. Subsequently we apply 3D convolutional layers where the kernel size in the spatial dimension is , , and . Each of these layers has a kernel size of 2 in the time dimension, with stride increasing exponentially as for layer . The number of channels is doubled at every layer. The resulting network is significantly larger than what is commonly used for reinforcement learning in the Atari environment, which we found to be necessary to learn to beat our lengthy demonstration.We choose the reset point adjustment threshold used in Algorithm 2 as , such that we move the episode starting point back in time if at least of the rollout workers achieve returns comparable to the provided demonstration. We use PPO (Schulman et al., 2017b) to train the agent’s policy, and distribute learning over GPUs with 8 workers each, for a total of rollout workers. The agent was trained for about billion frames.
Result
Our trained agent achieves a final score of over approximately minutes of play, a double-speed recording of which is available at https://tinyurl.com/ybbheo86. We observe that although much of the agent’s game mirrors our demonstration, the agent surpasses the demonstration score of by picking up more diamonds along the way. In addition, the agent makes use of a feature of the game that was unknown to us: at minute 4:25 of the video sufficient time has passed for a key to re-appear, allowing the agent to proceed in a different way from the demonstration.
Table 1 compares our obtained score to results previously reported in the literature. Unfortunately, there is no standard way of evaluating performance in this setting: the game is deterministic, and different methods add different amounts of noise during action selection. Our result was achieved by sampling from a trained policy with low entropy. The amount of noise added is thus small, but comparable to the previous best results such as those by Pohlen et al. (2018) and Aytar et al. (2018).
Approach | Score |
---|---|
Count-based exploration (Ostrovski et al. (2017)) | 3,705.5 |
Unifying count-based exploration (Bellemare et al. (2016) ) | 6,600 |
DQfD (Hester et al. (2017)) | 4,739.6 |
Ape-X DQfD (Pohlen et al. (2018)) | 29,384 |
Playing by watching Youtube (Aytar et al. (2018)) | 41,098 |
Ours | 74,500 |
Like is often the case in reinforcement learning, we find that our trained neural net policy does not yet generalize at the level of a human player. One method to test for generalization ability proposed by Machado et al. (2017) is to perturb the policy by making actions sticky and repeating the last action with probability of 0.25 at every frame. Using this evaluation method our trained policy obtains a score of 10,000 on Montezuma’s Revenge on average. Alternatively, we can take random actions with probability (repeated for frameskipped steps), which leads to an average score of for our policy. Anecdotally, we find that such perturbations also significantly reduce the score of human players on Montezuma’s Revenge, but to a lesser extent. As far as we are aware, our results using perturbed policies are still better than all those published previously. Perturbing the learned policy by starting with between 0 and 30 random no-ops did not significantly hurt results, with the majority of rollouts achieving at least the final score obtained in our demonstration.
Training our agent to the reported result required 128 GPUs over a period of 2 weeks, which unfortunately made it impossible to quantify the consistency of our algorithms across runs. We leave a more systematic study of the reliability of the algorithm for future work.
Remaining Challenges
Although the step-by-step learning done by our agent is much simpler than learning to play from scratch, it is still far from trivial. One challenge our RL agent faces is that it is generally unable to reach the exact state from later on in a demonstration when it starts from an earlier state. This is because the agent plays the game at a different frameskip from what we used for recording the demonstration, but it is also due to the randomness in the actions which make it very unlikely to exactly reproduce any specific sequence of actions. The agent will thus need to be able to generalize between states that are very similar, but not identical. We found that this works well for Montezuma’s Revenge, but much less well for some other Atari games we tried, like Gravitar and Pitfall. One reason for this may be that these latter games require solving a harder vision problem: we found these games difficult to play from a downsampled screen ourselves, and we saw some improvement when using larger and deeper neural network policies.
Another challenge we encountered is that standard RL algorithms like policy gradients require striking a careful balance between exploration and exploitation: if the agent’s actions are too random, it makes too many mistakes to ever achieve the required final score when starting from the beginning of the game; if the actions are too deterministic, the agent stops learning because it does not explore alternative actions. Achieving the reported result on Montezuma’s Revenge thus required careful tuning of the coefficient of the entropy bonus used in PPO, in combination with other hyperparameters such as the learning rate and the scaling of rewards. For some other games like Gravitar and Pitfall we were unable to find hyperparameters that worked for training the full curriculum. We hope that future advances in RL will yield algorithms that are more robust to random noise and to the choice of hyperparameters.
5 Conclusion
Prior work on learning from demonstrations to solve difficult reinforcement learning tasks has focused mainly on imitation, which encourages identical behavior to that seen in the demonstration. In contrast, we propose a new method that optimizes returns directly. Our method breaks down a difficult exploration problem into a curriculum of subtasks, created by resetting from demonstration states. Our agent does not mimic the demonstrated behavior exactly and is able to find new and exciting solutions that the human demonstrator may not have considered, resulting in a higher score on Montezuma’s Revenge than obtained using previously published approaches.
References
- Aytar et al. (2018) Yusuf Aytar, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, and Nando de Freitas. Playing hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592, 2018.
- Bagnell et al. (2004) J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by dynamic programming. In Advances in neural information processing systems, pages 831–838, 2004.
- Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
-
Bellemare et al. (2013)
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013. - Chen et al. (2017) Richard Y Chen, John Schulman, Pieter Abbeel, and Szymon Sidor. UCB exploration via Q-ensembles. arXiv preprint arXiv:1706.01502, 2017.
- Dearden et al. (1999) Richard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 150–159. Morgan Kaufmann Publishers Inc., 1999.
- Espeholt et al. (2018) Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
- Florensa et al. (2017) Carlos Florensa, David Held, Markus Wulfmeier, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300, 2017.
- Garmulewicz et al. (2018) Michał Garmulewicz, Henryk Michalewski, and Piotr Miłoś. Expert-augmented actor-critic for vizdoom and montezumas revenge. arXiv preprint arXiv:1809.03447, 2018.
- Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.
- Hester et al. (2017) Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
- Hosu and Rebedea (2016) Ionel-Alexandru Hosu and Traian Rebedea. Playing atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:1607.05077, 2016.
- Kakade (2002) Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
-
Kolter and Ng (2009)
J Zico Kolter and Andrew Y Ng.
Near-bayesian exploration in polynomial time.
In
Proceedings of the 26th Annual International Conference on Machine Learning
, pages 513–520. ACM, 2009. - Machado et al. (2017) Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. arXiv preprint arXiv:1709.06009, 2017.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
- Nachum et al. (2017) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2772–2782, 2017.
- Nair et al. (2018) Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6292–6299. IEEE, 2018.
- O’Donoghue et al. (2016) Brendan O’Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Pgq: Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626, 2016.
- Ostrovski et al. (2017) Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and Rémi Munos. Count-based exploration with neural density models. arXiv preprint arXiv:1703.01310, 2017.
- Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
- Peng et al. (2018) Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. arXiv preprint arXiv:1804.02717, 2018.
- Pohlen et al. (2018) Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Večerík, et al. Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.
- Resnick et al. (2018) Cinjon Resnick, Roberta Raileanu, Sanyam Kapoor, Alex Peysakhovich, Kyunghyun Cho, and Joan Bruna. Backplay:" man muss immer umkehren". arXiv preprint arXiv:1807.06919, 2018.
- Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- Schulman et al. (2017a) John Schulman, Pieter Abbeel, and Xi Chen. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017a.
- Schulman et al. (2017b) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
- Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
- Tang et al. (2017) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. #Exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems (NIPS), 2017.
- Van Den Oord et al. (2016) Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In SSW, page 125, 2016.
- Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Wu et al. (2017) Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017.