Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy

07/24/2019 ∙ by Yijie Guo, et al. ∙ 1

This paper proposes a method for learning a trajectory-conditioned policy to imitate diverse demonstrations from the agent's own past experiences. We demonstrate that such self-imitation drives exploration in diverse directions and increases the chance of finding a globally optimal solution in reinforcement learning problems, especially when the reward is sparse and deceptive. Our method significantly outperforms existing self-imitation learning and count-based exploration methods on various sparse-reward reinforcement learning tasks with local optima. In particular, we report a state-of-the-art score of more than 25,000 points on Montezuma's Revenge without using expert demonstrations or resetting to arbitrary states.

READ FULL TEXT VIEW PDF

Authors

page 1

page 6

page 7

page 8

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Efficient exploration to learn a (near-)optimal behavior in the long term is a challenging problem in reinforcement learning (RL). With sparse reward and no expert demonstrations available, the agent must carefully balance the exploration and exploitation when taking a long sequence of actions to receive infrequent non-zero rewards. Many existing methods provide a guidance for exploration based on visitation counts (strehl2008analysis, ; bellemare2016unifying, ; ostrovski2017count, ; tang2017exploration, ; choi2018contingency, ) or errors in predicting dynamics (urgen1991adaptive, ; schmidhuber1991curious, ; stadie2015incentivizing, ; pathak2017curiosity, ; burda2018exploration, ; burda2018large, ) to encourage visiting novel states, where the trade-off between exploration and exploitation is usually controlled by the weight of the exploration signal.

Figure 1: Map of Key-Door-Treasure domain, where the reward for getting an apple, picking up a key, opening a door and getting a treasure is 1, 1, 1, 5. The time limit for one episode is 35 steps.

Self-Imitation Learning (SIL) (oh2018self, ) is one of the recent methods to tackle this problem of the exploration-exploitation dilemma. This method exploits the agent’s previous good trajectories to improve the efficiency of learning, demonstrating that exploitation could indirectly drive further exploration in certain environments. However, in environments with locally optimal solutions, exploiting the experience of accumulating deceptive rewards may mislead the agent and prevent it from reaching a higher return in the long term. For example, as illustrated in Figure 1, the agent starts in the bottom left corner where it easily collects the apple near its initial location by random exploration and achieves a small positive reward. The SIL agent exploits the trajectory following the orange path and learns to collect the nearby rewards quickly. However, it is less likely to collect the key, open the door, and get the treasure within a given time limit. Therefore, in order to find the optimal path (purple), it is better to exploit the past experience in diverse directions (gray paths), instead of focusing only on the trajectories with the myopic and sub-optimal rewards.

This paper investigates the imitation of diverse past trajectories and how that leads further exploration and avoids getting stuck at a sub-optimal behavior. Specifically, we propose to use a buffer of the past trajectories to cover diverse possible directions. Then we learn a trajectory-conditioned policy to imitate any trajectory from the buffer, treating it as a demonstration. After completing the demonstration, the agent performs random exploration. In this way, the exploration frontier is indirectly pushed further in diverse directions. The explored region is gradually expanded and the chance of finding a globally optimal solution increases. After finding trajectories with near-optimal (or high) total rewards, we imitate them to learn the final policy that achieves good performance.

Our main contributions are summarized as follows: (1) We propose a novel architecture for a trajectory-conditioned policy to imitate diverse demonstrations. (2) We demonstrate the importance of imitating diverse past experiences to indirectly drive exploration to different regions of the environment, by comparing it with existing approaches on various sparse-reward reinforcement learning tasks with the discrete and continuous action space. (3) In particular, we achieve a performance competitive with the state-of-the-art on hard exploration Atari game of Montezuma’s Revenge without using expert demonstrations or resetting to an arbitrary state.

2 Related Work

Self-Imitation

Learning a good policy by imitating past experiences has been discussed in (oh2018self, ; gangwani2018learning, ; guo2018generative, ), where the agent is trained to imitate only the high-reward trajectories with the SIL (oh2018self, ) or GAIL objective (gangwani2018learning, ). In contrast, we store the past trajectories ending with diverse states in the buffer, because trajectories with low reward in the short term could lead to high reward in the long term, and thus following a diverse set of trajectories could be beneficial for discovering optimal solutions. Furthermore, our method focuses on explicit trajectory-level imitation while existing methods use sampled state-action pairs from the buffer to update the policy. Gangwani et al. (gangwani2018learning, ) proposed to learn multiple diverse policies in a SIL framework using the Stein Variational Policy Gradient with the Jensen-Shannon kernel. Empirically, their exploration method can be limited by the number of policies learned simultaneously and the exploration performance of each single policy, as shown in Appendix F.

Exploration

In a high level, exploration methods (urgen1991adaptive, ; thrun1992efficient, ; thrun1992active, ; auer2002using, ; chentanez2005intrinsically, ; oudeyer2007intrinsic, ; strehl2008analysis, ; oudeyer2009intrinsic, ) in RL tend to award a bonus (via intrinsic reward) to encourage an agent to visit novel states. Recently this idea was scaled up to large state spaces by utilizing approximation techniques (tang2017exploration, ), density models (bellemare2016unifying, ) and inverse dynamic models to localize the agent (choi2018contingency, ) or a random network to evaluate the novelty of a state (burda2018exploration, ). We propose that instead of directly taking a quantification of novelty as an intrinsic reward signal, one can encourage exploration by rewarding the agent when it successfully follows demonstration trajectories that would lead to novel states. Go-Explore (ecoffet2019go, ) also shows the benefit of exploration by returning to a promising state to solve hard-exploration Atari games, though its success relies on the assumption that the environment is deterministic and resettable. Here resetting to an arbitrary state can result in two to three orders of magnitude reduction in sample complexity, thus giving an unfair advantage over methods that do not make use of resetting; more importantly, such resetting is often infeasible in real environments. As discussed in Appendix H when using a perfect goal-conditioned policy, as opposed to a direct ‘reset’ function, Go-Explore could not explore as efficiently as our method. Previous works attempted reaching a goal state by learning a set of sub-policies (liu2019learning, ) or a goal-conditioned policy in pixel observation space (dong2019explicit, ). However, these policies do not perform well on sparse-reward environments such as Montezuma’s Revenge. Our method provides an indirect ‘reset’ function in a stochastic environment by imitating a trajectory using a goal-conditioned policy. Several studies (gregor2016variational, ; eysenbach2018diversity, ; pong2019skew, ) seek a diversity of exploration by maximizing the entropy of mixture skill policies or generated goal states. However, these methods mainly focus on learning diverse skills or goal states as a continuous latent variable, and the experiments are performed mainly on a variety of simulated robotic tasks with a relatively simple observation space.

Goal-Conditioned Policy

Many previous works (andrychowicz2017hindsight, ; nair2017combining, ; schaul2015universal, ; pathak2018zero, ) studied learning a goal-conditioned policy. Similarly to hindsight experience replay (andrychowicz2017hindsight, )

, our approach samples goal states from past experiences. However, we use past experiences through both supervised learning and reinforcement learning objectives. Compared to a single goal state the state trajectory leads the agent to follow a demonstration in a soft order to reach the goal state even far away from the current state. Our method shares the same motivation as Duan et al. 

(duan2017one, )

which use an attention model over the demonstration and follow the idea of the sequence-to-sequence model

(sutskever2014sequence, ; cho2014learning, ). However, our architecture is simpler since it does not use an attention model over the current observation and it is evaluated on a variety of environments, while Duan et al. (duan2017one, ) mainly focuses on the block stacking task.

Imitation Learning

The goal of imitation learning is to train a policy to mimic a given demonstration. For example, DQfD (hester2018deep, ), Ape-X DQfD (pohlen2018observe, ), TDC+CMC (aytar2018playing, ), and LfSD (salimans2018learning, ) achieve good results on the hard-exploration Atari games using human demonstrations. In contrast, our method does not rely on expert trajectories; instead, it treats the agent’s own past trajectories as demonstrations.

  Initialize parameter for the trajectory-conditioned policy
  Initialize the trajectory buffer # Store diverse past trajectories
  Initialize set of transitions in the current episode # Store current episode trajectory
  Initialize set of on-policy samples # Store data for on-policy PPO update
  Initialize demonstration trajectory
  for each iteration do
     for each step  do
        Observe and choose an action
        Execute action in the environment to get
        Store transition
        # Positive reward if agent follows demonstration
        # No reward after agent completes and then takes random exploration
        Determine by comparing with (Eq. 1)
        Store on-policy sample
     end for
     if  is terminal then
         UpdateBuffer() (Alg. 2)
        Clear current episode trajectory
         SampleDemo() (Alg. 3)
     end if
     # Perform PPO update using on-policy samples and supervised learning update
      (Eq. 3.4, Eq. 3)
     Clear on-policy samples
  end for
Algorithm 1 Diverse Self-Imitation Learning with Trajectory-Conditioned Policy

3 Method

The main idea of our method is to maintain a buffer of diverse trajectories collected during training and to train a trajectory-conditioned policy by leveraging reinforcement learning and supervised learning to roughly follow demonstration trajectories sampled from the trajectory buffer. Therefore, the agent is encouraged to explore beyond various visited states in the environment and gradually push its exploration frontier further. Ideally, we want to find trajectories with near-optimal total rewards. After this, we fine-tune the final policy to imitate the best trajectories found during training.111In our implementation, we train a trajectory-conditioned policy to imitate the best trajectories. Alternatively, an unconditional stochastic policy could also be trained to imitate them. We name our method as Diverse Trajectory-conditioned Self-Imitation Learning (DTSIL).

3.1 Background and Notation

We first briefly describe the standard reinforcement learning setting that we build our approach upon. Specifically, at each time step , an agent observes a state and selects an action , where it receives a reward when transitioning from a state to a next state , where is a set of all states and is a set of all actions. The goal of policy-based RL algorithms is to find a policy parameterized by that maximizes the expected discounted return , where is a discount factor.

In our work, we assume a state includes the agent’s observation (e.g., raw pixel image) and a high-level abstract state embedding (e.g., the agent’s location in the abstract space). The embedding may be learnable from (or ), but in this work, we assume that a high-level embedding is provided as a part of . A trajectory-conditioned policy (which we refer to as in shorthand notation) takes a sequence of state embeddings as input for a demonstration, where is the length of the trajectory . A sequence of the agent’s past state embeddings is provided to determine which part of the demonstration has been followed. Together with the current observation , it helps to determine the correct action to accurately imitate the demonstration. Our goal here is to find a set of optimal state embedding sequence(s) and the policy to maximize the return: . For robustness we may want to find multiple near-optimal embedding sequences with similar returns and a trajectory-conditioned policy for executing them.

3.2 Organizing Trajectory Buffer

We maintain a trajectory buffer of diverse past trajectories. For each embedding-trajectory-count tuple , is the best trajectory ending with a state with the high-level representation , and is the number of times this state representation has been visited during training. In order to maintain the compact buffer a high-level discrete state representation is used (e.g., the agent’s location in the discrete grid, discretized accumulated reward, etc.) and the existing entry is replaced if an improved trajectory is found.

When given a new episode , all the state representations in this episode are considered because the buffer maintains all of the possible paths available for future exploration to avoid missing any possibility to find an optimal solution. If is not yet stored in the buffer, is directly pushed into the buffer, where is the agent’s current partial episode ending with . If the partial episode is better (i.e., higher return or shorter trajectory) than the stored trajectory when reaching , is replaced by the current trajectory . The algorithm is described in Appendix A.

3.3 Sampling Demonstrations

When learning trajectory-conditioned policy , demonstration trajectories are sampled from the trajectory buffer. We record the count

of how many times this state embedding is visited, and set the sampling probability as

. This is inspired by count-based exploration bonus (strehl2008analysis, ; bellemare2016unifying, ): we sample a trajectory that ends with a less frequently visited state because this leads the agent to reach rarely visited regions in the state space and is more promising for discovering novel states. When trajectories with satisfactory accumulated rewards have been discovered, we sample the best trajectories stored in the buffer for imitation learning.222We assume that the optimal episode reward or the threshold of an ideal episode reward is known in advance. Alternatively, we could switch between exploration and imitation by adjusting the probability of imitation in an episode as training goes on, as discussed in Appendix I. These trajectories are used to train the policy to converge to a high-reward behavior. The algorithm is described in Appendix B.

Figure 2: An example showing the calculation of based on and . At each step , we check the state embedding to determine the reward according to Equation 1. After completing the demonstration, the agent performs the random exploration with reward 0 ().
Figure 3: Architecture of the trajectory-conditioned policy.

3.4 Learning Trajectory-Conditioned Policy

Imitation Reward

A given demonstration trajectory is used to provide rewards for imitation , similarly to the imitation learning method introduced by aytar2018playing . At the beginning of an episode, the index of the last visited state embedding in the demonstration is initialized as . At each step , if the agent’s new state has an embedding and it is the same as any of the next state embeddings starting from the last visited state embedding in the demonstration (i.e., where ), then it receives a positive imitation reward , and the index of the last visited state embedding in the demonstration is updated as . This encourages the agent to visit the state embeddings in the demonstration in a soft-order. When the last state embedding in the demonstration has been visited by the agent (i.e., ), then there is no further imitation reward and the agent performs a random exploration until the episode terminates. To summarize, the agent receives a reward defined as

(1)

where is a monotonically increasing function (e.g., clipping (mnih2015dqn, )). Figure 3 illustrates the calculation of and the update of during an episode when the agent visits a state whose embedding appears in the demonstration .

Policy Architecture

For imitation learning with diverse demonstrations, we design a trajectory-conditioned policy that should imitate any given trajectory

. Inspired by neural machine translation methods

(sutskever2014sequence, ; cho2014learning, ; bahdanau2014neural, ; luong2015effective, )

, the demonstration trajectory is the source sequence and the incomplete trajectory of the agent’s state representations is the target sequence. We apply a recurrent neural network and an attention mechanism to the sequence data to predict actions that would make the agent to follow the demonstration trajectory. As illustrated in Figure

3, RNN computes the hidden features for each state embedding () in the demonstration and derives the hidden features for the agent’s state representation . Then the attention weights is computed by comparing the current agent’s hidden features with the demonstration’s hidden features (

). The context vector

is computed as an attention-weighted summation of the demonstration’s hidden states to capture the relevant information in the demonstration trajectory and to predict the action .

Reinforcement Learning Objective

With the reward defined as (Equation 1), the trajectory-conditioned policy can be trained with a policy gradient algorithm (sutton2000policy, ; schulman2017proximal, ):

(2)

where the expectation indicates the empirical average over a finite batch of on-policy samples and denotes the number of rollout steps taken in each iteration. We use Proximal Policy Optimization (PPO) (schulman2017proximal, ) as an actor-critic policy gradient algorithm for our experiments.

Supervised Learning Objective

To improve trajectory-conditioned imitation learning and to better leverage the past trajectories, we propose a supervised learning objective. By sampling a trajectory from the buffer , the demonstration trajectory (i.e., source sequence) is formulated as and assumed that the agent’s incomplete trajectory (i.e., target sequence) is the partial trajectory for any . Then is the ‘correct’ action at step for the agent to imitate the demonstration. Our supervised learning objective is to maximize the log probability of taking such actions:

(3)

4 Experiments

In the experiments, we aim to answer the following questions: (1) How well does the trajectory-conditioned policy imitate the diverse demonstration trajectories? (2) Does imitation of the past diverse experience enable the agent to further explore more diverse directions and guide the exploration to find the trajectory with a near-optimal total reward? (3) Can our proposed method aid in avoiding myopic behaviors and converge to near-optimal solutions?

4.1 Implementation Details

Our algorithm begins with an empty buffer and we initialize the demonstration as a list of zero vectors. With such an input demonstration, the agent would perform random exploration to collect trajectories to fill the buffer . In practice, when , the sampled demonstration trajectory could be lengthy. We present a part of the demonstration as the input into the policy, similar to translating a paragraph sentence by sentence. More specifically, we first input () into the policy. When the index of the agent’s last visited state embedding in the demonstration belongs to , we consider that the agent has accomplished this part of the demonstration, and therefore switch to the next part . We repeat this process until the last part of the demonstration. If the last part is less than

steps long, we pad the sequence with zero vectors.

A reward function is used on the Mujoco domain, and is used on other environments. We use

as a reward to encourage the imitation. Further details about the hyperparameters and the environment setup are described in the Appendix. Our PPO is based on OpenAI’s implementation 

(baselines, ).

We compare our method with the following baselines: (1) PPO: Proximal Policy Optimization (schulman2017proximal, ); (2) PPO+EXP: PPO with count-based exploration bonus , where is the number of times the discrete state representation was visited during training; (3) PPO+SIL: PPO with Self-Imitation Learning (oh2018self, ).

4.2 Key-Door-Treasure Domain

The Key-Door-Treasure domain (shown in Figure 1) is a simple grid-world environment with deceptive rewards that can lead the agent to a local optima. An observation consists of the agent’s location

and binary variables showing whether the agent has gotten the key, opened the door, or collected the treasure. A state is represented as the agent’s location and the cumulative reward:

, indicating the location of the agent and identifying the collected objects.

(a) Curve of the average episode reward
(b) Curve of the best episode reward found during training
(c) Curve of the number of discrete state embeddings found during training
(d) Curve of the average imitation success ratio
Figure 4: Learning curves on Key-Door-Treasure domain averaged over 5 runs, where the curves in dark colors are average over the 5 curves in light colors. The x-axis and y-axis correspond to the number of steps and statistics about the performance, respectively. The average reward and average imitation success ratio are the mean values over 40 recent episodes.
Figure 5: Visualization of the trajectories stored in the buffer for PPO+SIL and our method as training continues. The agent (gray), apple (red), key (blue), door (green), and treasure (yellow) are shown as squares for simplicity.

As shown in Figure 3(a), both PPO and PPO+SIL agents are stuck with the suboptimal policy of collecting the first two apples (Figure 1). The PPO+EXP agent explores further and gathers the two apples and one key. Our method learns to collect objects on the right side of the maze and achieves the highest total reward of 8 within the time limit. Figure 3(b) and Figure 3(c) show the highest episode reward and the number of different state embeddings found during training. The PPO+EXP agent occasionally scores the episode reward of 5 because its initial location is not fixed and it is possible to luckily collect three apples, pick up the key and open the door within the time limit if it started from a certain location. However, when the agent explores to find the reward signals in the right site of the maze, it never finds a path to the treasure.

In Figure 3(d), we show the average success ratio of the imitation during training. It is defined as follows: for a given demonstration , let be the index of the last visited state embedding in , then the success ratio of imitating is (i.e., the portion of trajectory imitated). Ideally, we want the success ratio to be 1.0, which indicates that the trajectory-conditioned policy could successfully follow any given demonstration from the buffer. At 3M steps, when the trajectories with the optimal total reward 8 are found, we start to sample the best trajectories stored in the buffer as demonstrations. Our trajectory-conditioned policy can imitate them well with a success ratio around 1.0.

Figure 5 visualizes a learning process. PPO+SIL fails on this task because the agent quickly exploits a good experience of collecting the apples but the buffer is filled with the trajectories exploring the nearby region. On the contrary, our method maintains a buffer of diverse trajectories which are used as demonstrations to guide the agent to explore different regions and discover an optimal behavior.

4.3 Toy Montezuma’s Revenge Domain

Figure 6: Map of Toy Montezuma’s Revenge, where we show the agent (gray), key(blue), door(green), and treasure (yellow) as squares. The rewards are 100, 300, and 10000, respectively. An optimal path with the highest total reward of 11,600 is shown as a red line.
Figure 7: Learning curves on Toy Montezuma’s Revenge averaged over 5 runs.

We evaluate our method on a more challenging domain, Toy Montezuma’s Revenge (roderick2018deep, ), which requires a more sophisticated strategy to explore the environment. As shown in Figure 7, there are 24 rooms similar to the layout of the first level of Atari Montezuma’s Revenge, with a discrete grid for each room. The agent should navigate the labyrinth to locate the keys, unlock the doors and reach the goal (the treasure room). The observation is represented by the agent’s location and cumulative episode reward. The state representation is the same as the observation.

The learning curve of the averaged episode reward in Figure 7 shows that PPO, PPO+SIL, and PPO+EXP could not learn a policy to reach the goal. The PPO+EXP agent occasionally finds a trajectory with the total reward of 11,200 reaching the treasure room, but fails to exploit this experience. On the other hand, our method learns a good behavior of not only reaching the goal room, but also collecting all of the keys to achieve an optimal total reward of 11,600.

4.4 Atari Montezuma’s Revenge

We evaluate our method on the hard-exploration game Montezuma’s Revenge of the Arcade Learning Environment (ALE) (bellemare2013arcade, ; machado2017revisiting, ), which consists of multiple levels and 24 rooms in each level (map shown in Figure 11). The observation is a frame of raw pixel images, and we use the state representation consisting of the agent’s ground truth location (obtained from RAM) and the number of keys it holds. We use the number of keys to reduce the size of the state embedding space. With such a discrete state representation, we can also add a count-based exploration bonus to in Equation 1, in order to expedite exploration. This variant is denoted as ‘DTSIL+EXP’.

Figure 8: Learning curves of the average episode reward, the best episode reward, and the number of different rooms found on Atari Montezuma’s Revenge, averaged over 5 runs. For DTSIL+EXP, 29,278 is the maximum mean score (averaged over 40 recent episodes) achieved over 800M environment timesteps (3,200M frames) during training, averaged over 5 runs. DTSIL+EXP discovers around 40 rooms on average while PPO+EXP never finds a path to pass through all 24 rooms at the first level and then proceed to the next level.

As shown in Figure 8, in the early stage, the average episode reward of DTSIL+EXP is worse than PPO+EXP because our policy is trained to imitate diverse demonstrations instead of directly maximizing the environment reward. Compared to PPO+EXP, our agent is not eager to blindly follow high-reward paths since the path with a relatively low score in the short term might lead to higher rewards in the long term. As training continues, our method successfully discovers trajectories to accomplish the first level with a satisfactory total reward of more than 20,000. After switching to imitating the best trajectories in the buffer by sampling them as demonstrations, the average episode reward dramatically increases to over 25,000333Demo videos of the learned policies for both PPO+EXP and DTSIL+EXP are available at: https://sites.google.com/view/diverse-sil/home.. Table 1 compares our proposed method with previous works without using any expert demonstration or resetting to an arbitrary state, where our approach significantly outperforms the other approaches.

Method DTSIL Abstract-HRL A2C+CoEX+RAM SmartHash DeepCS A2C+SIL PPO+CoEX RND
Score 29,278 11,500 6,600 5,661 3,500 2,500 11,618 10,070
Table 1: Comparison with the state-of-the-art results on Montezuma’s Revenge without using expert demonstrations or resetting to an arbitrary state. DTSIL, Abstract-HRL (liu2019learning, ), A2C+CoEX+RAM (choi2018contingency, ), SmartHash (tang2017exploration, ), and DeepCS (stanton2018deep, ) make use of information from RAM, while A2C+SIL (oh2018self, ), PPO+CoEX (choi2018contingency, ), and RND (burda2018exploration, ) do not use RAM information. The score is averaged over multiple runs, gathered from each paper.

4.5 Mujoco

We evaluate our proposed method on continuous control tasks. We adapt the maze environment introduced in (duan2016benchmarking, ) to construct a set of challenging tasks which require the point mass agent to collect the key, open the door with the same color and finally reach the treasure to get a high score. One key cannot be used any more once it was used before to open a door with the same color, which makes the agent to be easily trapped. A visualization of these environments is shown in Figure 9.

The observation is the agent’s location and range sensor reading about nearby objects. The state representation is . We discretize the continuous variables as integers when determining whether a state embedding has a corresponding trajectory stored in the buffer and whether a state embedding in the demonstration has been visited by the agent.

Figure 9: Point Maze in Mujoco domain. The reward for getting the key, opening the door, and collecting the treasure (yellow block) is 1, 2, and 6 respectively. Once the agent uses a key to open a door with the matched color, the key cannot be used again. The learning curve of the episode reward and the number of found state representations are averaged over 3 independent runs.

In the first maze of Figure 9, the agent can easily get the blue key near its initial location and open the blue door in the upper part. However, the optimal path is to bring the key to open the blue door in the bottom and obtain the treasure, reaching an episode reward of 9. In the second maze, the agent should bring the blue key and pick up the green key while avoiding opening the blue door in the upper part. Then, the green and blue key can open the two doors at the bottom of the maze, which results in the total reward of 12. The learning curves in Figure 9 show that PPO, PPO+EXP, and PPO+SIL may get stuck at a sub-optimal behavior, whereas our policy eventually converges to the behavior achieving the optimal episode reward.

5 Conclusion

This paper proposes to learn diverse policies by imitating diverse trajectory-level demonstrations through count-based exploration over these trajectories. Imitation of diverse past trajectories can guide the agent to rarely visited states and encourages further exploration of novel states. We demonstrate that on a variety of environments with local optima, our method significantly improves self-imitation learning (SIL). It avoids prematurely converging to a sub-optimal solution and learns a near-optimal behavior in order to achieve a high total reward.

References

Appendix A Algorithm of Trajectory Buffer Update

  Input: the trajectory buffer
  Input: the current episode
  # Consider all the states in
  for each step  do
     # Consider state and partial episode
     if there exists where  then
        # Compare partial episode with stored trajectory
        if  has higher total reward or reaches the same total reward with less steps  then
           
        end if
        
     else
         where
     end if
  end for
  return  
Algorithm 2 Update Trajectory Buffer

Appendix B Algorithm of Sampling Demonstrations

  Input: the trajectory buffer
  if trajectories with satisfactory reward have been found then
     # sample the trajectories reaching near-optimal score in the buffer
      for all
  else

     Calculate probability distribution

     
     Sample Categorical
      for all
  end if
  return  
Algorithm 3 Sample Demonstration Trajectories

Appendix C Hyperparameters

The hyper-parameters for our proposed method used in each experiment are listed in Table 2.

Environment Key-Door-Treasure Toy MontezumaRevenge Atari MontezumaRevenge Mujoco
Learning Rate
2.5e-4 2.5e-4 2.5e-4 1e-4
2 4 8 8
Length of demonstration
input part
10 10 10 10
Wight of supervised
learning loss
10 decreases to 1
when action prediction
accuracy >0.85
10 decreases to 1
when action prediction
accuracy >0.85
10 decreases to 1
when action prediction
accuracy >0.75
0
Satisfactory
episode reward
8 11600 20000 9 or 12
Table 2: Hyper-parameters on various environments for our experiments.

Appendix D Environment Setting

On Atari Montezuma’s Revenge, the number of keys the agent holds at step is approximated by because the agent gets one key with reward 100 and loses one key opening a door with reward 300. More details of the environment setting are summarized in Table 3.

Environment Key-Door-Treasure Toy MontezumaRevenge Atari MontezumaRevenge Mujoco
Observation
agent’s location (x, y)
in 17x13 grid and
binary variables
indicating whether
key, door or treasure
is collected
agent’s location
(room, x, y)
in 24x11x11 grid and
accumulated reward
stacked most recent
4 gray observations
with shape 84x84x4
agent’s location (x,y)
in 22x22 space
and range sensor
reading about
nearby objects
State
Representation
where in 9x9 grid
Action
5 discrete actions:
up, down, left,
right, noop
5 discrete actions:
up, down, left,
right, noop
18 discrete actions:
noop, fire, left,
in
continuous action space
Reward
key +1
door +1
treasure +5
apple +1
key +100
door +300
treasure +10000
key +100
door +300
key +1
door +2
treasure +6
Time limit
35 steps
1000 steps
4500 steps
1000 steps
Stochasticity
take 3 random
steps before the
episode starts
take 5 random
steps before the
episode starts
MontezumaRevenge-
-NoFrameskip-v4;
take a random number
between 0 and 30
of noop actions
before the episode starts
take random normal
noise from the
agent’s initial position
Table 3: The setting on various environments for our experiments.

Appendix E Details of Network Architecture and Training Process

In the trajectory-conditioned policy, we first embed the input state (or

) with a fully-connected layer with 64 units. Next, RNN with gated recurrent units (GRU) computes the feature

(or ) with 128 units. The attention weight is calculated based on the Bahdanau attention mechanism [4]. The concatenation of the context vector , the hidden feature of agent’s current state , and convolutional features from the observation are used to predict . For experiments on the Key-Door-Treasure domain, Toy Montezuma’s Revenge, and Mujoco, the features from are not required for the policy. However, on the Atari game of Montezuma’s Revenge, it is necessary to take the raw observation as input into policy because the location information in could not guide the agent to use the temporal context (e.g. avoiding moving skulls and passing laser gates).

During the imitation phase in our implementation, we sample the top 1% best trajectories (with the highest total reward and least number of steps) in the buffer as demonstrations, and train our trajectory-conditioned policy to imitate them and achieve the good performance.

Appendix F Comparison with Learning Diverse Policies by SVPG

While the code for the Stein variational policy gradient (SVPG) in [18] has not yet been released, we replicate the method in [18] to learn diverse policies. Their experiments focus on continuous control tasks with relatively simple observation spaces with limited local optimal branches in the state space. We learn 8 diverse policies in parallel following their method on our Key-Door-Treasure domain with discrete action space. Figure 10 shows a visualization of the learning progress: the 8 policies learn to cover different regions of the environment. The method explores better than PPO+SIL, but the exploration of each individual agent is not strong enough to find the optimal path to achieve the highest episode reward.

Figure 10: Visualization of the trajectories stored in the buffer for PPO+SIL, SVPG diverse [18] and our method as training continues. In the second row, we show the trajectories for a total of 8 policies learned simultaneously with the SVPG method proposed in [18], where each color corresponds to the trajectories collected by each policy.

Appendix G Map of Atari Montezuma’s Revenge at the First Level

A map of Atari Montezuma’s Revenge at the first level is shown in Figure 11. It is challenging to bring two keys to open the two doors in room 17 behind the treasure in room 15, where the agent can pass to the next level.

Figure 11: Map of Atari Montezuma’s Revenge at the first level with 24 rooms.

Appendix H Comparison with Go-Explore

To evaluate the efficiency of exploration, we compare our method with the “exploration phase” in the Go-Explore algorithm [16]. The idea behind Go-Explore is to reset the agent to any interesting state sampled from the buffer of state embeddings, and then explore further using random actions. For fair comparison to our method, we modify the Go-Explore code such that we could not reset to any arbitrary states in the environment. We also use the same state representation and the same sampling weight to sample goal states from the buffer.

In the Go-Explore method without using the direct ‘reset’ function and with a perfect goal-conditioned policy to visit any state sampled from the buffer, the agent could precisely advance to the goal state by following the stored trajectory. The total steps taken in the environment are counted by summing the number of steps taken to follow the stored trajectories and the number of steps taken to explore.

Figure 12: Learning curves of the number of rooms and the number of different state representations found on Atari Montezuma’s Revenge, averaged over 5 runs. The curves in dark colors are the average of the 5 curves in light colors. During training, the state representation used is .

In Figure 12, we show the average number of rooms found and the number of different state representations found during training. Even if we assume that there is a perfect goal-conditioned policy in Go-Explore to guide the agent to follow the stored trajectory exactly and visit the goal state, the learning curves demonstrate that our method is more efficient for exploring diverse state representations and consequently visits several rooms. This is because our method uses the count-based exploration bonus to encourages the exploration around and beyond the stored trajectories and the imitation reward allows the agent to follow the demonstrations in a soft-order.

Appendix I Automatic Switching between Exploration and Imitation

As stated in Section 3, when the agent has discovered trajectories with a satisfactory total reward the imitation is switched on. However, instead of setting the threshold of the satisfactory total reward in a handcrafted way we can automatically switch between exploration and imitation. In the early stage of training the agent prefers the exploration mode because there are few good trajectories to imitate. As the training progresses it is reasonable to train the policy to follow the best trajectories stored in the buffer. Assuming the agent has taken steps in the environment and that the total number of steps is (e.g., 800M steps for the experiments on Montezuma’s Revenge) the probability of using the imitation mode is set to . Therefore, the probability of performing imitation of good trajectories is increasing as training continues and the agent switches between exploration and imitation automatically without the need for setting the threshold of a satisfactory trajectory reward.

Figure 13: Learning curves of the average episode reward, the best episode reward, and the number of different rooms found on Atari Montezuma’s Revenge, averaged over 3 runs. The probability of taking the imitation mode is automatically adjusted during training. 30,877 is the maximum mean score (averaged over 40 recent episodes) achieved over 800M environment timesteps (3,200M frames) during training, averaged over 3 runs.

As shown in Figure 13, by automatically switching between exploration and imitation our method can reach the average episode reward of around 30,000 over 3 runs within 800M environment steps.