1 Introduction
Efficient exploration to learn a (near)optimal behavior in the long term is a challenging problem in reinforcement learning (RL). With sparse reward and no expert demonstrations available, the agent must carefully balance the exploration and exploitation when taking a long sequence of actions to receive infrequent nonzero rewards. Many existing methods provide a guidance for exploration based on visitation counts (strehl2008analysis, ; bellemare2016unifying, ; ostrovski2017count, ; tang2017exploration, ; choi2018contingency, ) or errors in predicting dynamics (urgen1991adaptive, ; schmidhuber1991curious, ; stadie2015incentivizing, ; pathak2017curiosity, ; burda2018exploration, ; burda2018large, ) to encourage visiting novel states, where the tradeoff between exploration and exploitation is usually controlled by the weight of the exploration signal.
SelfImitation Learning (SIL) (oh2018self, ) is one of the recent methods to tackle this problem of the explorationexploitation dilemma. This method exploits the agent’s previous good trajectories to improve the efficiency of learning, demonstrating that exploitation could indirectly drive further exploration in certain environments. However, in environments with locally optimal solutions, exploiting the experience of accumulating deceptive rewards may mislead the agent and prevent it from reaching a higher return in the long term. For example, as illustrated in Figure 1, the agent starts in the bottom left corner where it easily collects the apple near its initial location by random exploration and achieves a small positive reward. The SIL agent exploits the trajectory following the orange path and learns to collect the nearby rewards quickly. However, it is less likely to collect the key, open the door, and get the treasure within a given time limit. Therefore, in order to find the optimal path (purple), it is better to exploit the past experience in diverse directions (gray paths), instead of focusing only on the trajectories with the myopic and suboptimal rewards.
This paper investigates the imitation of diverse past trajectories and how that leads further exploration and avoids getting stuck at a suboptimal behavior. Specifically, we propose to use a buffer of the past trajectories to cover diverse possible directions. Then we learn a trajectoryconditioned policy to imitate any trajectory from the buffer, treating it as a demonstration. After completing the demonstration, the agent performs random exploration. In this way, the exploration frontier is indirectly pushed further in diverse directions. The explored region is gradually expanded and the chance of finding a globally optimal solution increases. After finding trajectories with nearoptimal (or high) total rewards, we imitate them to learn the final policy that achieves good performance.
Our main contributions are summarized as follows: (1) We propose a novel architecture for a trajectoryconditioned policy to imitate diverse demonstrations. (2) We demonstrate the importance of imitating diverse past experiences to indirectly drive exploration to different regions of the environment, by comparing it with existing approaches on various sparsereward reinforcement learning tasks with the discrete and continuous action space. (3) In particular, we achieve a performance competitive with the stateoftheart on hard exploration Atari game of Montezuma’s Revenge without using expert demonstrations or resetting to an arbitrary state.
2 Related Work
SelfImitation
Learning a good policy by imitating past experiences has been discussed in (oh2018self, ; gangwani2018learning, ; guo2018generative, ), where the agent is trained to imitate only the highreward trajectories with the SIL (oh2018self, ) or GAIL objective (gangwani2018learning, ). In contrast, we store the past trajectories ending with diverse states in the buffer, because trajectories with low reward in the short term could lead to high reward in the long term, and thus following a diverse set of trajectories could be beneficial for discovering optimal solutions. Furthermore, our method focuses on explicit trajectorylevel imitation while existing methods use sampled stateaction pairs from the buffer to update the policy. Gangwani et al. (gangwani2018learning, ) proposed to learn multiple diverse policies in a SIL framework using the Stein Variational Policy Gradient with the JensenShannon kernel. Empirically, their exploration method can be limited by the number of policies learned simultaneously and the exploration performance of each single policy, as shown in Appendix F.
Exploration
In a high level, exploration methods (urgen1991adaptive, ; thrun1992efficient, ; thrun1992active, ; auer2002using, ; chentanez2005intrinsically, ; oudeyer2007intrinsic, ; strehl2008analysis, ; oudeyer2009intrinsic, ) in RL tend to award a bonus (via intrinsic reward) to encourage an agent to visit novel states. Recently this idea was scaled up to large state spaces by utilizing approximation techniques (tang2017exploration, ), density models (bellemare2016unifying, ) and inverse dynamic models to localize the agent (choi2018contingency, ) or a random network to evaluate the novelty of a state (burda2018exploration, ). We propose that instead of directly taking a quantification of novelty as an intrinsic reward signal, one can encourage exploration by rewarding the agent when it successfully follows demonstration trajectories that would lead to novel states. GoExplore (ecoffet2019go, ) also shows the benefit of exploration by returning to a promising state to solve hardexploration Atari games, though its success relies on the assumption that the environment is deterministic and resettable. Here resetting to an arbitrary state can result in two to three orders of magnitude reduction in sample complexity, thus giving an unfair advantage over methods that do not make use of resetting; more importantly, such resetting is often infeasible in real environments. As discussed in Appendix H when using a perfect goalconditioned policy, as opposed to a direct ‘reset’ function, GoExplore could not explore as efficiently as our method. Previous works attempted reaching a goal state by learning a set of subpolicies (liu2019learning, ) or a goalconditioned policy in pixel observation space (dong2019explicit, ). However, these policies do not perform well on sparsereward environments such as Montezuma’s Revenge. Our method provides an indirect ‘reset’ function in a stochastic environment by imitating a trajectory using a goalconditioned policy. Several studies (gregor2016variational, ; eysenbach2018diversity, ; pong2019skew, ) seek a diversity of exploration by maximizing the entropy of mixture skill policies or generated goal states. However, these methods mainly focus on learning diverse skills or goal states as a continuous latent variable, and the experiments are performed mainly on a variety of simulated robotic tasks with a relatively simple observation space.
GoalConditioned Policy
Many previous works (andrychowicz2017hindsight, ; nair2017combining, ; schaul2015universal, ; pathak2018zero, ) studied learning a goalconditioned policy. Similarly to hindsight experience replay (andrychowicz2017hindsight, )
, our approach samples goal states from past experiences. However, we use past experiences through both supervised learning and reinforcement learning objectives. Compared to a single goal state the state trajectory leads the agent to follow a demonstration in a soft order to reach the goal state even far away from the current state. Our method shares the same motivation as Duan et al.
(duan2017one, )which use an attention model over the demonstration and follow the idea of the sequencetosequence model
(sutskever2014sequence, ; cho2014learning, ). However, our architecture is simpler since it does not use an attention model over the current observation and it is evaluated on a variety of environments, while Duan et al. (duan2017one, ) mainly focuses on the block stacking task.Imitation Learning
The goal of imitation learning is to train a policy to mimic a given demonstration. For example, DQfD (hester2018deep, ), ApeX DQfD (pohlen2018observe, ), TDC+CMC (aytar2018playing, ), and LfSD (salimans2018learning, ) achieve good results on the hardexploration Atari games using human demonstrations. In contrast, our method does not rely on expert trajectories; instead, it treats the agent’s own past trajectories as demonstrations.
3 Method
The main idea of our method is to maintain a buffer of diverse trajectories collected during training and to train a trajectoryconditioned policy by leveraging reinforcement learning and supervised learning to roughly follow demonstration trajectories sampled from the trajectory buffer. Therefore, the agent is encouraged to explore beyond various visited states in the environment and gradually push its exploration frontier further. Ideally, we want to find trajectories with nearoptimal total rewards. After this, we finetune the final policy to imitate the best trajectories found during training.^{1}^{1}1In our implementation, we train a trajectoryconditioned policy to imitate the best trajectories. Alternatively, an unconditional stochastic policy could also be trained to imitate them. We name our method as Diverse Trajectoryconditioned SelfImitation Learning (DTSIL).
3.1 Background and Notation
We first briefly describe the standard reinforcement learning setting that we build our approach upon. Specifically, at each time step , an agent observes a state and selects an action , where it receives a reward when transitioning from a state to a next state , where is a set of all states and is a set of all actions. The goal of policybased RL algorithms is to find a policy parameterized by that maximizes the expected discounted return , where is a discount factor.
In our work, we assume a state includes the agent’s observation (e.g., raw pixel image) and a highlevel abstract state embedding (e.g., the agent’s location in the abstract space). The embedding may be learnable from (or ), but in this work, we assume that a highlevel embedding is provided as a part of . A trajectoryconditioned policy (which we refer to as in shorthand notation) takes a sequence of state embeddings as input for a demonstration, where is the length of the trajectory . A sequence of the agent’s past state embeddings is provided to determine which part of the demonstration has been followed. Together with the current observation , it helps to determine the correct action to accurately imitate the demonstration. Our goal here is to find a set of optimal state embedding sequence(s) and the policy to maximize the return: . For robustness we may want to find multiple nearoptimal embedding sequences with similar returns and a trajectoryconditioned policy for executing them.
3.2 Organizing Trajectory Buffer
We maintain a trajectory buffer of diverse past trajectories. For each embeddingtrajectorycount tuple , is the best trajectory ending with a state with the highlevel representation , and is the number of times this state representation has been visited during training. In order to maintain the compact buffer a highlevel discrete state representation is used (e.g., the agent’s location in the discrete grid, discretized accumulated reward, etc.) and the existing entry is replaced if an improved trajectory is found.
When given a new episode , all the state representations in this episode are considered because the buffer maintains all of the possible paths available for future exploration to avoid missing any possibility to find an optimal solution. If is not yet stored in the buffer, is directly pushed into the buffer, where is the agent’s current partial episode ending with . If the partial episode is better (i.e., higher return or shorter trajectory) than the stored trajectory when reaching , is replaced by the current trajectory . The algorithm is described in Appendix A.
3.3 Sampling Demonstrations
When learning trajectoryconditioned policy , demonstration trajectories are sampled from the trajectory buffer. We record the count
of how many times this state embedding is visited, and set the sampling probability as
. This is inspired by countbased exploration bonus (strehl2008analysis, ; bellemare2016unifying, ): we sample a trajectory that ends with a less frequently visited state because this leads the agent to reach rarely visited regions in the state space and is more promising for discovering novel states. When trajectories with satisfactory accumulated rewards have been discovered, we sample the best trajectories stored in the buffer for imitation learning.^{2}^{2}2We assume that the optimal episode reward or the threshold of an ideal episode reward is known in advance. Alternatively, we could switch between exploration and imitation by adjusting the probability of imitation in an episode as training goes on, as discussed in Appendix I. These trajectories are used to train the policy to converge to a highreward behavior. The algorithm is described in Appendix B.3.4 Learning TrajectoryConditioned Policy
Imitation Reward
A given demonstration trajectory is used to provide rewards for imitation , similarly to the imitation learning method introduced by aytar2018playing . At the beginning of an episode, the index of the last visited state embedding in the demonstration is initialized as . At each step , if the agent’s new state has an embedding and it is the same as any of the next state embeddings starting from the last visited state embedding in the demonstration (i.e., where ), then it receives a positive imitation reward , and the index of the last visited state embedding in the demonstration is updated as . This encourages the agent to visit the state embeddings in the demonstration in a softorder. When the last state embedding in the demonstration has been visited by the agent (i.e., ), then there is no further imitation reward and the agent performs a random exploration until the episode terminates. To summarize, the agent receives a reward defined as
(1) 
where is a monotonically increasing function (e.g., clipping (mnih2015dqn, )). Figure 3 illustrates the calculation of and the update of during an episode when the agent visits a state whose embedding appears in the demonstration .
Policy Architecture
For imitation learning with diverse demonstrations, we design a trajectoryconditioned policy that should imitate any given trajectory
. Inspired by neural machine translation methods
(sutskever2014sequence, ; cho2014learning, ; bahdanau2014neural, ; luong2015effective, ), the demonstration trajectory is the source sequence and the incomplete trajectory of the agent’s state representations is the target sequence. We apply a recurrent neural network and an attention mechanism to the sequence data to predict actions that would make the agent to follow the demonstration trajectory. As illustrated in Figure
3, RNN computes the hidden features for each state embedding () in the demonstration and derives the hidden features for the agent’s state representation . Then the attention weights is computed by comparing the current agent’s hidden features with the demonstration’s hidden features (). The context vector
is computed as an attentionweighted summation of the demonstration’s hidden states to capture the relevant information in the demonstration trajectory and to predict the action .Reinforcement Learning Objective
With the reward defined as (Equation 1), the trajectoryconditioned policy can be trained with a policy gradient algorithm (sutton2000policy, ; schulman2017proximal, ):
(2) 
where the expectation indicates the empirical average over a finite batch of onpolicy samples and denotes the number of rollout steps taken in each iteration. We use Proximal Policy Optimization (PPO) (schulman2017proximal, ) as an actorcritic policy gradient algorithm for our experiments.
Supervised Learning Objective
To improve trajectoryconditioned imitation learning and to better leverage the past trajectories, we propose a supervised learning objective. By sampling a trajectory from the buffer , the demonstration trajectory (i.e., source sequence) is formulated as and assumed that the agent’s incomplete trajectory (i.e., target sequence) is the partial trajectory for any . Then is the ‘correct’ action at step for the agent to imitate the demonstration. Our supervised learning objective is to maximize the log probability of taking such actions:
(3) 
4 Experiments
In the experiments, we aim to answer the following questions: (1) How well does the trajectoryconditioned policy imitate the diverse demonstration trajectories? (2) Does imitation of the past diverse experience enable the agent to further explore more diverse directions and guide the exploration to find the trajectory with a nearoptimal total reward? (3) Can our proposed method aid in avoiding myopic behaviors and converge to nearoptimal solutions?
4.1 Implementation Details
Our algorithm begins with an empty buffer and we initialize the demonstration as a list of zero vectors. With such an input demonstration, the agent would perform random exploration to collect trajectories to fill the buffer . In practice, when , the sampled demonstration trajectory could be lengthy. We present a part of the demonstration as the input into the policy, similar to translating a paragraph sentence by sentence. More specifically, we first input () into the policy. When the index of the agent’s last visited state embedding in the demonstration belongs to , we consider that the agent has accomplished this part of the demonstration, and therefore switch to the next part . We repeat this process until the last part of the demonstration. If the last part is less than
steps long, we pad the sequence with zero vectors.
A reward function is used on the Mujoco domain, and is used on other environments. We use
as a reward to encourage the imitation. Further details about the hyperparameters and the environment setup are described in the Appendix. Our PPO is based on OpenAI’s implementation
(baselines, ).We compare our method with the following baselines: (1) PPO: Proximal Policy Optimization (schulman2017proximal, ); (2) PPO+EXP: PPO with countbased exploration bonus , where is the number of times the discrete state representation was visited during training; (3) PPO+SIL: PPO with SelfImitation Learning (oh2018self, ).
4.2 KeyDoorTreasure Domain
The KeyDoorTreasure domain (shown in Figure 1) is a simple gridworld environment with deceptive rewards that can lead the agent to a local optima. An observation consists of the agent’s location
and binary variables showing whether the agent has gotten the key, opened the door, or collected the treasure. A state is represented as the agent’s location and the cumulative reward:
, indicating the location of the agent and identifying the collected objects.As shown in Figure 3(a), both PPO and PPO+SIL agents are stuck with the suboptimal policy of collecting the first two apples (Figure 1). The PPO+EXP agent explores further and gathers the two apples and one key. Our method learns to collect objects on the right side of the maze and achieves the highest total reward of 8 within the time limit. Figure 3(b) and Figure 3(c) show the highest episode reward and the number of different state embeddings found during training. The PPO+EXP agent occasionally scores the episode reward of 5 because its initial location is not fixed and it is possible to luckily collect three apples, pick up the key and open the door within the time limit if it started from a certain location. However, when the agent explores to find the reward signals in the right site of the maze, it never finds a path to the treasure.
In Figure 3(d), we show the average success ratio of the imitation during training. It is defined as follows: for a given demonstration , let be the index of the last visited state embedding in , then the success ratio of imitating is (i.e., the portion of trajectory imitated). Ideally, we want the success ratio to be 1.0, which indicates that the trajectoryconditioned policy could successfully follow any given demonstration from the buffer. At 3M steps, when the trajectories with the optimal total reward 8 are found, we start to sample the best trajectories stored in the buffer as demonstrations. Our trajectoryconditioned policy can imitate them well with a success ratio around 1.0.
Figure 5 visualizes a learning process. PPO+SIL fails on this task because the agent quickly exploits a good experience of collecting the apples but the buffer is filled with the trajectories exploring the nearby region. On the contrary, our method maintains a buffer of diverse trajectories which are used as demonstrations to guide the agent to explore different regions and discover an optimal behavior.
4.3 Toy Montezuma’s Revenge Domain
We evaluate our method on a more challenging domain, Toy Montezuma’s Revenge (roderick2018deep, ), which requires a more sophisticated strategy to explore the environment. As shown in Figure 7, there are 24 rooms similar to the layout of the first level of Atari Montezuma’s Revenge, with a discrete grid for each room. The agent should navigate the labyrinth to locate the keys, unlock the doors and reach the goal (the treasure room). The observation is represented by the agent’s location and cumulative episode reward. The state representation is the same as the observation.
The learning curve of the averaged episode reward in Figure 7 shows that PPO, PPO+SIL, and PPO+EXP could not learn a policy to reach the goal. The PPO+EXP agent occasionally finds a trajectory with the total reward of 11,200 reaching the treasure room, but fails to exploit this experience. On the other hand, our method learns a good behavior of not only reaching the goal room, but also collecting all of the keys to achieve an optimal total reward of 11,600.
4.4 Atari Montezuma’s Revenge
We evaluate our method on the hardexploration game Montezuma’s Revenge of the Arcade Learning Environment (ALE) (bellemare2013arcade, ; machado2017revisiting, ), which consists of multiple levels and 24 rooms in each level (map shown in Figure 11). The observation is a frame of raw pixel images, and we use the state representation consisting of the agent’s ground truth location (obtained from RAM) and the number of keys it holds. We use the number of keys to reduce the size of the state embedding space. With such a discrete state representation, we can also add a countbased exploration bonus to in Equation 1, in order to expedite exploration. This variant is denoted as ‘DTSIL+EXP’.
As shown in Figure 8, in the early stage, the average episode reward of DTSIL+EXP is worse than PPO+EXP because our policy is trained to imitate diverse demonstrations instead of directly maximizing the environment reward. Compared to PPO+EXP, our agent is not eager to blindly follow highreward paths since the path with a relatively low score in the short term might lead to higher rewards in the long term. As training continues, our method successfully discovers trajectories to accomplish the first level with a satisfactory total reward of more than 20,000. After switching to imitating the best trajectories in the buffer by sampling them as demonstrations, the average episode reward dramatically increases to over 25,000^{3}^{3}3Demo videos of the learned policies for both PPO+EXP and DTSIL+EXP are available at: https://sites.google.com/view/diversesil/home.. Table 1 compares our proposed method with previous works without using any expert demonstration or resetting to an arbitrary state, where our approach significantly outperforms the other approaches.
Method  DTSIL  AbstractHRL  A2C+CoEX+RAM  SmartHash  DeepCS  A2C+SIL  PPO+CoEX  RND 
Score  29,278  11,500  6,600  5,661  3,500  2,500  11,618  10,070 
4.5 Mujoco
We evaluate our proposed method on continuous control tasks. We adapt the maze environment introduced in (duan2016benchmarking, ) to construct a set of challenging tasks which require the point mass agent to collect the key, open the door with the same color and finally reach the treasure to get a high score. One key cannot be used any more once it was used before to open a door with the same color, which makes the agent to be easily trapped. A visualization of these environments is shown in Figure 9.
The observation is the agent’s location and range sensor reading about nearby objects. The state representation is . We discretize the continuous variables as integers when determining whether a state embedding has a corresponding trajectory stored in the buffer and whether a state embedding in the demonstration has been visited by the agent.
In the first maze of Figure 9, the agent can easily get the blue key near its initial location and open the blue door in the upper part. However, the optimal path is to bring the key to open the blue door in the bottom and obtain the treasure, reaching an episode reward of 9. In the second maze, the agent should bring the blue key and pick up the green key while avoiding opening the blue door in the upper part. Then, the green and blue key can open the two doors at the bottom of the maze, which results in the total reward of 12. The learning curves in Figure 9 show that PPO, PPO+EXP, and PPO+SIL may get stuck at a suboptimal behavior, whereas our policy eventually converges to the behavior achieving the optimal episode reward.
5 Conclusion
This paper proposes to learn diverse policies by imitating diverse trajectorylevel demonstrations through countbased exploration over these trajectories. Imitation of diverse past trajectories can guide the agent to rarely visited states and encourages further exploration of novel states. We demonstrate that on a variety of environments with local optima, our method significantly improves selfimitation learning (SIL). It avoids prematurely converging to a suboptimal solution and learns a nearoptimal behavior in order to achieve a high total reward.
References
 [1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.

[2]
P. Auer.
Using confidence bounds for exploitationexploration tradeoffs.
Journal of Machine Learning Research
, 3(Nov):397–422, 2002.  [3] Y. Aytar, T. Pfaff, D. Budden, T. Paine, Z. Wang, and N. de Freitas. Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pages 2930–2941, 2018.
 [4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [5] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.

[6]
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  [7] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Largescale study of curiositydriven learning. arXiv preprint arXiv:1808.04355, 2018.
 [8] Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
 [9] N. Chentanez, A. G. Barto, and S. P. Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288, 2005.
 [10] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 [11] J. Choi, Y. Guo, M. Moczulski, J. Oh, N. Wu, M. Norouzi, and H. Lee. Contingencyaware exploration in reinforcement learning. arXiv preprint arXiv:1811.01483, 2018.
 [12] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
 [13] H. Dong, J. Mao, X. Cui, and L. Li. Explicit recall for efficient exploration, 2019.
 [14] Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. Oneshot imitation learning. In Advances in neural information processing systems, pages 1087–1098, 2017.
 [15] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
 [16] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. Goexplore: a new approach for hardexploration problems. arXiv preprint arXiv:1901.10995, 2019.
 [17] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
 [18] T. Gangwani, Q. Liu, and J. Peng. Learning selfimitating diverse policies. arXiv preprint arXiv:1805.10309, 2018.
 [19] K. Gregor, D. J. Rezende, and D. Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
 [20] Y. Guo, J. Oh, S. Singh, and H. Lee. Generative adversarial selfimitation learning. arXiv preprint arXiv:1812.00950, 2018.
 [21] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep qlearning from demonstrations. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [22] E. Z. Liu, R. Keramati, S. Seshadri, K. Guu, P. Pasupat, E. Brunskill, and P. Liang. Learning abstract models for longhorizon exploration, 2019.
 [23] M.T. Luong, H. Pham, and C. D. Manning. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
 [24] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2017.
 [25] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 2015.
 [26] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine. Combining selfsupervised learning and imitation for visionbased rope manipulation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2146–2153. IEEE, 2017.
 [27] J. Oh, Y. Guo, S. Singh, and H. Lee. Selfimitation learning. arXiv preprint arXiv:1806.05635, 2018.
 [28] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Countbased exploration with neural density models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2721–2730. JMLR. org, 2017.
 [29] P.Y. Oudeyer and F. Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.

[30]
P.Y. Oudeyer, F. Kaplan, and V. V. Hafner.
Intrinsic motivation systems for autonomous mental development.
IEEE transactions on evolutionary computation
, 11(2):265–286, 2007. 
[31]
D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell.
Curiositydriven exploration by selfsupervised prediction.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pages 16–17, 2017.  [32] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell. Zeroshot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2050–2053, 2018.
 [33] T. Pohlen, B. Piot, T. Hester, M. G. Azar, D. Horgan, D. Budden, G. BarthMaron, H. van Hasselt, J. Quan, M. Večerík, et al. Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.
 [34] V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine. Skewfit: Statecovering selfsupervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
 [35] M. Roderick, C. Grimm, and S. Tellex. Deep abstract qnetworks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 131–138. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
 [36] T. Salimans and R. Chen. Learning montezuma’s revenge from a single demonstration. arXiv preprint arXiv:1812.03381, 2018.
 [37] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In International conference on machine learning, pages 1312–1320, 2015.
 [38] J. Schmidhuber. Curious modelbuilding control systems. In [Proceedings] 1991 IEEE International Joint Conference on Neural Networks, pages 1458–1463. IEEE, 1991.
 [39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [40] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
 [41] C. Stanton and J. Clune. Deep curiosity search: Intralife exploration improves performance on challenging deep reinforcement learning problems. arXiv preprint arXiv:1806.00553, 2018.

[42]
A. L. Strehl and M. L. Littman.
An analysis of modelbased interval estimation for markov decision processes.
Journal of Computer and System Sciences, 74(8):1309–1331, 2008.  [43] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [44] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [45] H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel. # exploration: A study of countbased exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
 [46] S. B. Thrun. Efficient exploration in reinforcement learning. 1992.
 [47] S. B. Thrun and K. Möller. Active exploration in dynamic environments. In Advances in neural information processing systems, pages 531–538, 1992.
 [48] J. urgen Schmidhuber. Adaptive confidence and adaptive curiosity. Technical report, Citeseer, 1991.
Appendix A Algorithm of Trajectory Buffer Update
Appendix B Algorithm of Sampling Demonstrations
Appendix C Hyperparameters
The hyperparameters for our proposed method used in each experiment are listed in Table 2.
Environment  KeyDoorTreasure  Toy MontezumaRevenge  Atari MontezumaRevenge  Mujoco  

2.5e4  2.5e4  2.5e4  1e4  

2  4  8  8  

10  10  10  10  




0  

8  11600  20000  9 or 12 
Appendix D Environment Setting
On Atari Montezuma’s Revenge, the number of keys the agent holds at step is approximated by because the agent gets one key with reward 100 and loses one key opening a door with reward 300. More details of the environment setting are summarized in Table 3.
Environment  KeyDoorTreasure  Toy MontezumaRevenge  Atari MontezumaRevenge  Mujoco  
Observation 








Action 





Reward 





Time limit 





Stochasticity 




Appendix E Details of Network Architecture and Training Process
In the trajectoryconditioned policy, we first embed the input state (or
) with a fullyconnected layer with 64 units. Next, RNN with gated recurrent units (GRU) computes the feature
(or ) with 128 units. The attention weight is calculated based on the Bahdanau attention mechanism [4]. The concatenation of the context vector , the hidden feature of agent’s current state , and convolutional features from the observation are used to predict . For experiments on the KeyDoorTreasure domain, Toy Montezuma’s Revenge, and Mujoco, the features from are not required for the policy. However, on the Atari game of Montezuma’s Revenge, it is necessary to take the raw observation as input into policy because the location information in could not guide the agent to use the temporal context (e.g. avoiding moving skulls and passing laser gates).During the imitation phase in our implementation, we sample the top 1% best trajectories (with the highest total reward and least number of steps) in the buffer as demonstrations, and train our trajectoryconditioned policy to imitate them and achieve the good performance.
Appendix F Comparison with Learning Diverse Policies by SVPG
While the code for the Stein variational policy gradient (SVPG) in [18] has not yet been released, we replicate the method in [18] to learn diverse policies. Their experiments focus on continuous control tasks with relatively simple observation spaces with limited local optimal branches in the state space. We learn 8 diverse policies in parallel following their method on our KeyDoorTreasure domain with discrete action space. Figure 10 shows a visualization of the learning progress: the 8 policies learn to cover different regions of the environment. The method explores better than PPO+SIL, but the exploration of each individual agent is not strong enough to find the optimal path to achieve the highest episode reward.
Appendix G Map of Atari Montezuma’s Revenge at the First Level
A map of Atari Montezuma’s Revenge at the first level is shown in Figure 11. It is challenging to bring two keys to open the two doors in room 17 behind the treasure in room 15, where the agent can pass to the next level.
Appendix H Comparison with GoExplore
To evaluate the efficiency of exploration, we compare our method with the “exploration phase” in the GoExplore algorithm [16]. The idea behind GoExplore is to reset the agent to any interesting state sampled from the buffer of state embeddings, and then explore further using random actions. For fair comparison to our method, we modify the GoExplore code such that we could not reset to any arbitrary states in the environment. We also use the same state representation and the same sampling weight to sample goal states from the buffer.
In the GoExplore method without using the direct ‘reset’ function and with a perfect goalconditioned policy to visit any state sampled from the buffer, the agent could precisely advance to the goal state by following the stored trajectory. The total steps taken in the environment are counted by summing the number of steps taken to follow the stored trajectories and the number of steps taken to explore.
In Figure 12, we show the average number of rooms found and the number of different state representations found during training. Even if we assume that there is a perfect goalconditioned policy in GoExplore to guide the agent to follow the stored trajectory exactly and visit the goal state, the learning curves demonstrate that our method is more efficient for exploring diverse state representations and consequently visits several rooms. This is because our method uses the countbased exploration bonus to encourages the exploration around and beyond the stored trajectories and the imitation reward allows the agent to follow the demonstrations in a softorder.
Appendix I Automatic Switching between Exploration and Imitation
As stated in Section 3, when the agent has discovered trajectories with a satisfactory total reward the imitation is switched on. However, instead of setting the threshold of the satisfactory total reward in a handcrafted way we can automatically switch between exploration and imitation. In the early stage of training the agent prefers the exploration mode because there are few good trajectories to imitate. As the training progresses it is reasonable to train the policy to follow the best trajectories stored in the buffer. Assuming the agent has taken steps in the environment and that the total number of steps is (e.g., 800M steps for the experiments on Montezuma’s Revenge) the probability of using the imitation mode is set to . Therefore, the probability of performing imitation of good trajectories is increasing as training continues and the agent switches between exploration and imitation automatically without the need for setting the threshold of a satisfactory trajectory reward.
As shown in Figure 13, by automatically switching between exploration and imitation our method can reach the average episode reward of around 30,000 over 3 runs within 800M environment steps.