Episodic Self-Imitation Learning with Hindsight

11/26/2020 ∙ by Tianhong Dai, et al. ∙ Imperial College London 0

Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state-action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation.



There are no comments yet.


page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) has has been shown to be very effective in training agents within gaming environments Mnih et al. (2015); Silver et al. (2016)

, particularly when combined with deep neural networks 

LeCun et al. (2015); Silver et al. (2016); Liu et al. (2017). In most tasks settings that are solved by RL algorithms, reward shaping is an essential requirement for guiding the learning of the agent. Reward shaping, however, often requires significant quantities of domain knowledge that are highly task-specific Ng et al. (1999) and, even with careful design, can lead to undesired policies. Moreover, for complex robotic manipulation tasks, manually designing reward shaping functions to guide the learning agent becomes intractable Arulkumaran et al. (2017); Florensa et al. (2017) if even minor variations to the task are introduced. For such settings, the application of deep reinforcement learning requires algorithms that can learn from unshaped, and usually sparse, reward signals. The complicated dynamics of robot manipulation exacerbate the difficulty posed by sparse rewards, especially for on-policy RL algorithms. For example, achieving goals that require successfully executing multiple steps over a long horizon involves high dimensional control that must also generalise to work across variations in the environment for each step. These aspects of robot control result in a situation where a naive RL agent so rarely receives a reward at the start of training that it is not able to learn at all. A common solution in the robotics community is to collect a sufficient quantity of expert demonstrations, then use imitation learning to train the agent. However, in some scenarios, demonstrations are expensive to collect and the achievable performance of a trained agent is restricted by their quantity. One solution is to use the valuable past experiences of the agent to enhance training, and this is particularly useful in sparse reward environments.

To alleviate the problems associated with having sparse rewards, there are two kinds of approaches: imitation learning and hindsight experience replay (HER). First, the standard approach of imitation learning is to use supervised learning algorithms and minimise a surrogate loss with respect to an oracle. The most common form is learning from demonstrations 

Hester et al. (2018); Gao et al. (2018). Similar techniques are applied to robot manipulation tasks Rajeswaran et al. (2017); Večerík et al. (2017); Nair et al. (2018); James et al. (2018). When the demonstrations are not attainable, self-imitation learning (SIL) Oh et al. (2018), which uses past good experiences (episodes in which the goal is achieved), can be used to enhance exploration or speed up the training of the agent. Self-imitation learning works well in discrete control environments, such as Atari Games. Whilst being able to learn policies for continuous control tasks with dense or delayed rewards Oh et al. (2018), the present experiments suggest that SIL struggles when rewards are sparse. Recently, hindsight experience replay has been proposed to solve such goal-conditional, sparse reward problems. The main idea of HER Andrychowicz et al. (2017) is that during replay, the selected transitions are sampled from state–action pairs derived from achieved goalsthat are substituted for the real goals of the task; this increases the frequency of positive rewards. Hindsight experience replay is used with off-policy RL algorithms, such as DQN Mnih et al. (2015) and DDPG Lillicrap et al. (2015), for experience replay and has several extensions Schaul et al. (2015); Liu et al. (2019a). The present experiments show that simply applying HER with SIL does not lead to an agent capable of performing tasks from the Fetch robot environment. In summary, self-imitation learning with on-policy algorithms for tasks that require continuous control, and for which rewards are sparse, remains unsolved.

In this paper, episodic self-imitation learning (ESIL) for goal-oriented problems that provide only sparse rewards is proposed and combined with a state-of-the-art on-policy RL algorithm: proximal policy optimization (PPO). In contrast to standard SIL, which samples past good transitions from the replay buffer for imitation learning, the proposed ESIL adopts entire current episodes (successful or not), and modifies them into “expert” trajectories based on HER. An extra trajectory selection module is also introduced to relieve the effects of sample correlation Lee et al. (2019) in updating the network. Figure 1 shows the difference between naive SIL+HER and ESIL. During training by SIL+HER, a batch of transitions is sampled from the replay buffer; these are modified into “hindsight experiences” and used directly in self-imitation learning. In contrast, ESIL utilises entire current collected episodes and converts them into hindsight episodes. The trajectory selection module removes undesired transitions in the hindsight episodes. Using tasks from the Open AI Fetch environment, this paper demonstrates that the proposed ESIL approach is effective in training agents which are required to solve continuous control problems, and shows that it achieves state-of-the-art results on several tasks.

Figure 1: Illustration of difference between self-imitation learning (SIL)+hindsight experience replay (HER) and episodic self-imitation learning (ESIL).

The primary contribution of this paper is a novel episodic self-imitation learning (ESIL) algorithm that can solve continuous control problems in environments providing only sparse rewards; in doing so, it also empirically answers an open question posed by Plappert et al. (2018). The proposed ESIL approach also provides a more efficient way to perform exploration in goal-conditional settings than the standard self-imitation learning algorithm. Finally, this approach achieves, to our knowledge, the best results for four moderately complex robot control tasks in simulation. The paper is organised into the following structure: Sections 2 and  3 provide an introduction to related work and corresponding background. Section  4 describes the methodology of the proposed ESIL approach. Section 5 introduces the settings and results of the experiments. Finally, Section 6 provides concluding remarks and suggestions for future research.

2 Related Work

Imitation learning (IL) can be divided into two main categories: behavioural cloning and inverse reinforcement learning Hussein et al. (2017). Behavioural cloning involves the learning of behaviours from demonstrations Bojarski et al. (2016); Xu et al. (2017); Torabi et al. (2018). Other extensions have an expert in the loop, such as DAgger Ross et al. (2011), or use an adversarial paradigm for the behavioural cloning method Ho and Ermon (2016); Wang et al. (2017)

. The inverse reinforcement learning estimates a reward model from expert trajectories 

Ng and Russell (2000); Abbeel and Ng (2004); Ziebart et al. (2008). Learning from demonstrations is powerful for complex robotic manipulation tasks Finn et al. (2016); Zhang et al. (2018); Finn et al. (2017); Rajeswaran et al. (2017); Fang et al. (2019a).  Ho and Ermon (2016) propose generative adversarial imitation learning (GAIL), which employs generative adversarial training to match the distribution of state–action pairs of demonstrations. Compared with behavioural cloning, the GAIL framework shows strong improvements in continuous control tasks. In the work of Ding et al. (2019), goalGAIL is proposed to speed up the training in goal-conditional environments; goalGAIL was also shown to be able to learn from demonstrations without action information. Prior work has used demonstrations to accelerate learning Rajeswaran et al. (2017); Večerík et al. (2017); Nair et al. (2018). Demonstrations are often collected by an expert policy or human actions. In contrast to these approaches, episodic self-imitation learning (ESIL) does not need demonstrations.

Self-imitation learning (SIL) Oh et al. (2018) is used for exploiting past experiences for parametric policies. It has a similar flavor to  Gangwani et al. (2018); Wu et al. (2019), in that the agent learns from imperfect demonstrations. During training, past good experiences are stored in the replay buffer. When SIL starts, transitions are sampled from the replay buffer according to the advantage values. In the work of Tang (2020), generalised SIL was proposed as an extension of SIL. It uses an -bound -learning approach to generalise the original SIL technique, and shows robustness to a wide range of continuous control tasks. Generalised SIL can also be combined with both deterministic and stochastic RL algorithms. Guo et al. (2019) points out that using imitation learning with past good experience could lead to a sub-optimal policy. Instead of imitating past good trajectories, a trajectory-conditioned policy Guo et al. (2019) is proposed to imitate trajectories in diverse directions, encouraging exploration in environments where exploration is otherwise difficult. Unlike SIL, episodic self-imitation learning (ESIL) applies HER to the current episodes to create “imperfect” demonstrations for imitation learning; this also requires introducing a trajectory-selection module to reject undesired samples from the hindsight experiences. In the work of  Lee et al. (2019), it was shown that the agent benefits from using whole episodes in updates, rather than uniformly sampling the sparse or delayed reward environments. The present experiments suggest that episodic self-imitation learning achieves better performance in an agent that must learn to perform continuous control in environments delivering sparse rewards.

Recently, the technique known as hindsight learning was developed. Hindsight experience replay (HER) Andrychowicz et al. (2017) is an algorithm that can overcome the exploration problems in multi-goal environments, delivering sparse rewards. Hindsight policy gradient (HPG) Rauber et al. (2019) introduces techniques that enable the learning of goal-conditional policies using hindsight experiences. However, the current implementation of HPG has only been evaluated for agents that need to perform discrete actions, and one drawback of hindsight policy gradient estimators is the computational cost because of the goal-oriented sampling. An extension of HER, called dynamic hindsight experience replay (DHER) Fang et al. (2019b), was proposed to deal with dynamic goals. Liu et al. (2019b) uses the GAIL framework Ho and Ermon (2016) to generate trajectories that are similar to hindsight experiences; it then applies imitation learning, using these trajectories. Competitive Experience Replay (CER) complements HER by introducing a competition between two agents for exploration Liu et al. (2019a). Zhao and Tresp (2018) point out that the hindsight trajectories which contain higher energy are more valuable during training, leading to a more efficient learning system. Fang et al. (2019c) proposed curriculum-guided HER, which  incorporates curriculum learning in the work. During training, the agent focuses on the closest goals in the initial stage, then focuses on the expanding the diversity of goals. This approach accelerates training compared with other baseline methods. Unlike these works, episodic self-imitation learning (ESIL) combines episodic hindsight experiences with imitation learning, which aids learning at the start of training. Furthermore, ESIL can be applied to continuous control, making it more suitable for control problems that demand greater precision.

3 Background

3.1 Reinforcement Learning

Reinforcement Learning (RL) can be formulated under the framework of a Markov Decision Process (MDP); it is used to learn an optimal policy to solve sequential decision-making problems. In each time step

, the state is received by the agent from the environment. An action is sampled by the agent according to its policy , parameterised by , which—in deep reinforcement learning—represent the weights of an artificial neural network. Then, the state and reward are provided by the environment to the agent. The goal is to have the agent learn a policy that maximises the expected return  Sutton and Barto (2018)


where is the discount factor. In a robot control setting, the state can be the velocity and position of each joint of the robotic arm. The action can be the velocities of actuators (control signals) and the reward might be calculated based on the distance between the gripper of the robot arm and the target position.

3.2 Proximal Policy Optimization

In this work, proximal policy optimization (PPO) Schulman et al. (2017) is selected as our base RL algorithm. This is a state-of-the-art, on-policy actor-critic approach to training. The actor-critic architecture is common in deep RL; it is composed of an actor network which is used to output a policy, and a critic network which outputs a value to evaluate the current state, . Proximal policy optimization (PPO) has been widely tested in robot control Andrychowicz et al. (2020) and video games Berner et al. (2019). In contrast with the “vanilla policy” gradient algorithms, proximal policy optimization (PPO) learns the policy using a surrogate objective function


where is the current policy and is the old policy; is a clipping ratio which limits the change between the updated and the previous policy during the training process. is the advantage value which can be estimated as , with being the return value, and  the state value predicted by the critic network.

3.3 Hindsight Experiences and Goals

The experiments follow the terminology suggested by OpenAI Plappert et al. (2018), in which the possible goals are drawn from , and the goal being pursued does not influence the environment dynamics. In ESIL, two types of goal are recognised. One is the desired goal , which is the target position or state, and may be different for different episodes. Within a single episode, is constant. The second type of goal is the achieved goal , which is the achieved state in the environment, and this is considered to be different at each time step in an episode. In an episode, each transition can be represented as , where indicates a state, indicates an action and indicates a reward; is simply used to represent grouping of goals.

In sparse reward settings, an agent will only get positive rewards when the desired goal is achieved. The sparse reward function can be defined as


where is a threshold value, used to identify if the agent has achieved the goal. However, the desired goal, , might be difficult to reach during training. Thus, hindsight experiences are created through replacing the original desired goal with the current achieved goal to augment the successful samples, and then reward can be recomputed according to Equation 3. The modification of the desired goal can be denoted as and transitions from hindsight experiences can be represented as . Intuitively, introducing serves a useful purpose in the early stages of training; taking, for example, a robot reaching task, the agent has no prior concept of how to move its effector to a specific location in space. Thus, even these original failed episodes contain valuable information for ultimately learning a useful control policy for the original, desired goal .

4 Methodology

The proposed method combines PPO and episodic self-imitation learning to maximally use hindsight experiences for exploration to improve learning. Recent advantages in episodic backward update Lee et al. (2019) and hindsight experiences Andrychowicz et al. (2017) are also leveraged to guide exploration for on-policy RL.

4.1 Episodic Self-Imitation Learning

The present method aims to use episodic hindsight experiences to guide the exploration of the PPO algorithm. To this end, hindsight experiences are created from current episodes. For an episode , let there be time steps; after , a series of transitions is collected. If at time step , in , , it implies that in this episode, the agent failed to achieve the original goal. Simply, to create hindsight experiences, the achieved goal in the last state is selected and considered as the modified desired goal , i.e., . Next, a new reward is computed under the new goal . Then, a new “imagined” episode is achieved, and a new series of transitions is collected.

Then, an approach to self-imitation learning based on episodic hindsight experiences is proposed, which applies the policy updates to both hindsight and in-environment episodes. Proximal policy optimization (PPO) is used as the base RL algorithm, which is a state-of-the-art on-policy RL algorithm. With current and corresponding hindsight experiences, a new objective function is introduced and defined as


where is the weight coefficient of . In the experiments, we set as default to balance the contribution of and . is the loss of PPO which can be written as


where is the policy loss which is parameterised by , is the value loss which is parameterised by , and is the weight coefficient of the , which is set to 1 to match the default PPO setting Schulman et al. (2017). The policy loss, , can be represented as


here, is the advantage value, and can be computed as . is the state value at time step which is predicted by the critic network. is the return at time step . is the clip ratio.  indicates original trajectories. The value loss is an squared error loss .

For the term, is an adaptive weight coefficient of ; it can be defined as the ratio of samples which are selected for self-imitation learning


where is the number of samples used for self-imitation learning and is the total number of collected samples. The episodic self-imitation learning loss can be written as


where indicates hindsight trajectories and is the trajectory selection module which is based on returns of the current episodes, , and the returns of corresponding hindsight experiences, .

4.2 Episodic Update with Hindsight

Two important issues of ESIL are: (1) hindsight experiences are sub-optimal, and (2) the detrimental effect of updating networks with correlated trajectories. Although episodic self-imitation learning makes exploration more effective, hindsight experiences are not from experts and not “perfect” demonstrations. With the training process continuing, if the agent is always learning these imperfect demonstrations, the policy will be stuck at the sub-optimal, or experience overfitting.

To prevent the agent learning from imperfect hindsight experiences, hindsight experiences are actively selected based on returns. With the same action, different goals may lead to different results. The proposed method only selects hindsight experiences that can achieve higher returns. The illustration of the trajectory selection module is in Figure 2. For an episodic experience and its hindsight experience, the returns of the episodic experience and its hindsight experience can be calculated, respectively. In a trajectory, at time step , the return can be calculated by . Then, for a trajectory , we have . For the hindsight experiences, similarly, the return for each time step, , with respect to the hindsight goals , can be calculated. Based on the modified trajectory with the same length of , we therefore have the returns . During training, the hindsight experiences with higher returns are used for self-imitation learning. The rest of the hindsight experiences will be supposed to be worthless samples and ignored. Then, Equation (8) can be rewritten as


where is the trajectory selection module. The selection function can be expressed as


here, is the unit step function. Consider the OpenAI FetchReach environment as an example. For a failed trajectory, the rewards are . The desired goal is modified to construct a new hindsight trajectory and the new rewards become . Then, and can be calculated separately.

Figure 2: A simplified illustration of trajectory selection. Blue trajectories indicate original experiences. Orange trajectories indicate hindsight experiences. Solid trajectories in the hindsight experiences are selected by the trajectory selection module of ESIL with new “imagined” goals.

From a goal perspective, episodic self-imitation learning (ESIL) tries to explore (desired) goals to get positive returns. It can be viewed as a form of multi-task learning, because ESIL has two objective functions to be optimised jointly. It is also related to self-imitation learning (SIL) Oh et al. (2018). However,  the difference is that SIL uses on past experiences to learn to choose the action chosen in the past in a given state, rather than goals. The full description of ESIL can be found in Algorithm 1.

0:  an actor network , a critic network , the maximum steps of an episode, a reward function
1:  for  do
2:     ,
3:     for  do
5:        for  do
6:           Sample an action using the actor network
7:           Execute the action and observe a new state
8:           Store the transition in
9:        end for
10:        for each transition in  do
11:           Clone the transition and replace with , where
13:           Store the transition in
14:        end for
15:        Store the trajectory and the hindsight trajectory in and , respectively
16:     end for
17:     Calculate the Return and for all transitions in and , respectively
18:     Calculate the PPO loss: using (5)
19:     Calculate the ESIL loss: using , and (8)
20:     Update the parameters and using loss (4)
21:  end for
Algorithm 1 Proximal policy optimization (PPO) with Episodic Self-Imitation Learning (ESIL)

5 Experiments and Results

The proposed method is evaluated on several multi-goal environments, including the Empty Room environment and the OpenAI Fetch environments (see Figure 3). The Empty Room environment is a toy example, and has discrete action spaces. In the Fetch environments, there are four robot tasks with continuous action spaces. To obtain a comprehensive comparison between the proposed method and other baseline approaches, suitable baseline approaches are selected for different environments. Ablation studies of the trajectory selection module are also performed.

(a) Empty Room (b) FetchReach (c) FetchPush (d) FetchPickPlace (e) FetchSlide
Figure 3: Evaluation environments. (a) is the Empty Room environment, in which a yellow circle indicates the position of the agent and a red star represents a target position. (b–e) are the Fetch robotic environments. The red spot represents a target position.

5.1 Setup

Empty Room (grid-world) environment: The Empty Room environment is a simple grid-world environment. The agent is placed in an grid, representing the room. The goal of the agent is to reach a target position in the room. The start position of the agent is at the left upper corner of the room, and the target position is randomly selected within the room. When the agent chooses an action that would lead it to fall outside the grid area, the agent stays at the current position. The length of each episode is 32. The desired goal, , is a two-dimensional grid coordinate which represents the target position. The achieved goal, , is also a two-dimensional coordinate which represents the current position of the agent at time step

, and finally, the observation is a two-dimensional coordinate which represents the current position of the agent. The agent has five actions: left, right, up, down and stay; the agent executes a random action with probability 0.2. The agent can get

as a reward only when , otherwise, it gets a reward of .

The agent is trained with 1 CPU core. In each epoch, 100 episodes are collected for the training. After each epoch, the agent is evaluated for 10 episodes. During training, the actions are sampled from the categorical distribution. During evaluation, the action with the highest probability will be chosen.

Fetch robotic (continuous) environments Plappert et al. (2018): The Fetch robotic environments are physically plausible simulations based on the real Fetch robot. The purpose of these environments is to provide a platform to tackle problems which are close to practical challenging robot manipulation tasks. Fetch is a 7-DoF robot arm with a two finger gripper. The Fetch environments include four tasks: FetchReach, FetchPush, FetchPickAndPlace and FetchSlide. For all Fetch tasks, the length of each episode is 50. The desired goal, , is a three-dimensional coordinate which represents the target position. If a task has an object, the achieved goal is a three-dimensional coordinate represents the position of the object. Otherwise,

is a three-dimensional coordinate represents the position of the gripper. Observations include the following information: position, velocity and state of the gripper. If a task has an object, the position, velocity and rotation information of the object is included. Therefore, the observation of FetchReach is a 10-dimensional vector. The observation of other tasks is a 25-dimensional vector. The action is a four-dimensional vector. The first three dimensions represent the relative position that the gripper needs to move in the next step. The last dimension indicates the distance between the fingers of the gripper. The reward function can be written as

, where .

In the Fetch environments, for FetchReach, FetchPush and FetchPickAndPlace tasks, the agent is trained using 16 CPU cores. In each epoch, 50 episodes are collected for training. The FetchSlide task is more complex, so 32 CPU cores are used. In each epoch, 100 episodes are collected for training. The Message Passing Interface (MPI) framework is used to perform synchronization when updating the network. After each epoch, the agent is evaluated for 10 episodes by each MPI worker. Finally, the success rate of each MPI worker is averaged. During training, actions are sampled from multivariate normal distributions. In the evaluation phase, the mean vector of the distribution is used as an action.

The proposed method, termed PPO+ESIL, is compared with different baselines on different environments. All experiments are plotted based on five runs with different seeds. The solid line is the median value. The upper bound is the 75th percentile and the lower bound is the 25th percentile.

5.2 Network Structure and Hyperparameters

Network structure

: Both the actor network and the critic network have three hidden layers with 256 neurons. ReLu is selected as the activation function for the hidden layers. In the grid-world environment, the actor network builds a categorical distribution. In the Fetch environment, the actor network builds normal distributions by producing mean vectors and the standard deviations of the independent variables.

Hyperparameters: For all experiments, the learning rate is 0.0003 for both the actor and critic networks. The discount factor is 0.98. Adam is chosen as an optimiser with . For each epoch, the actor network and critic network are updated 10 times. The clip ratio of the PPO algorithm is 0.2. For the grid-world environment, it trains networks for 100 epochs with batch size equals 160. Each epoch consists of 100 episodes. For the Fetch environments, in FetchReach task, it trains networks for 100 epochs and other tasks for 1000 epochs with batch size equals to 125. For FetchReach, FetchPush and FetchPickAndPlace tasks, each epoch consists of 50 episodes. For FetchSlide task, each epoch consists of 100 episodes. In designing the experiments, the number of episodes within an epoch is a balance between being able to train, the length of time required to run experiments and the maximum number of time steps that would be required to a achieve a goal. All environments have a fixed maximum number of time-steps , but this maximum differs depending on the problem or environment. This means that the number of state–action pairs can differ between two environments that have the same number of episodes and the same number of epochs. We arrange the episodes to try to compensate for the number of state–action pairs collected during training to make experiments easier to compare. The models are trained on a machine with an Intel i7-5960X CPU and 64GB RAM.

5.3 Grid-World Environments

To understand the basic properties of the proposed method, the toy Empty Room environment is used to evaluate ESIL. The following baselines are considered:

  • PPO: vanilla PPO Schulman et al. (2017) for discrete action spaces;

  • PPO+SIL/PPO+SIL+HER: Self-imitation learning (SIL) is used with PPO to solve hard exploration environments by imitating past good experiences Oh et al. (2018). In order to solve sparse rewards tasks, hindsight experience replay (HER) is applied to sampled transitions;

  • DQN+HER: Hindsight experience replay (HER), designed for sparse reward problems, is combined with a deep Q-learning network (DQN) Andrychowicz et al. (2017); this is an off policy algorithm;

  • Hindsight Policy Gradients (HPG): the vanilla implementation of HPG that is only suitable for discrete action spaces Rauber et al. (2019).

More specifically, PPO+ESIL is compared with above baseline methods in Figure 4a. This shows that PPO+ESIL converges faster than the other four baselines, and PPO+SIL converges faster than vanilla PPO, because PPO+SIL reuses past good experiences to help exploration and training. Hindsight Policy Gradient (HPG) is slower than the others because goal sampling is not efficient and also unstable.

(a) Comparison with on-policy baselines (b) Ablation study of selection module (c) Variation of adaptive weight coefficient (d) Comparison with off-policy baselines
Figure 4: Results of the grid-world environment. (a) Comparing the performance of PPO+ESIL between the on-policy approaches. (b) An ablation study on the trajectory selection module. (c) The variation of adaptive weight coefficient through training. (d) Comparison of the performance of PPO+ESIL to an off-policy approach: DQN+HER.

Further, the performance of the trajectory selection module is evaluated in Figure 4b. This shows that the selection strategy helps improve the performance. Hindsight experiences are not always perfect; the trajectory selection module filters some undesirable, modified experiences. Through adopting this selection strategy, the chance of agents learning from poor trajectories is reduced. The adaptive weight coefficient is also investigated in these experiments. In Figure 4c, it can be seen that at the initial stages of training, is high. This is because at this stage, the agent very seldom achieves the original goals. The hindsight experiences can yield higher returns than the original experiences. Therefore, a large proportion of hindsight experiences are selected to conduct self-imitation learning, helping the agent learn a policy for moving through the room. In the later stages of training, the agent can achieve success frequently, and the hindsight experiences might be redundant (e.g., ). In this case, undesired hindsight experiences are removed by using the trajectory selection module and leads the training. However, when the trajectory selection module is not employed, all hindsight experiences are used through the entire training process which includes the redundant hindsight experiences. This leads to overfitting and makes training unstable. Thus, the can provide the agent with a better initial policy, and the adaptive weight coefficient can balance the contributions of and properly during training.

(a) FetchReach (b) FetchPush (c) FetchPickAndPlace (d) FetchSlide
Figure 5: Results of comparison between ESIL and on-policy baselines on all Fetch environments.

Finally, the combination of PPO+ESIL is also compared with DQN+HER, which is an off-policy RL algorithm, in Figure 4d. This shows that DQN+HER works a little better than ESIL at the start of training. However, the proposed method achieves similar results to DQN+HER later in training.

5.4 Continuous Environments

Continuous control problems are generally more challenging for reinforcement learning. In the experiments of this section, the aim is to investigate how useful the proposed method is for several hard exploration OpenAI Gym Fetch tasks. These environments are commonly used to assess the performance of RL methods for continuous control. The following baselines are considered:

  • PPO: the vanilla PPO Schulman et al. (2017) for continuous action spaces;

  • PPO+SIL/PPO+SIL+HER: Self-imitation learning is used with PPO to solve hard exploration environments by imitating past good experiences Oh et al. (2018). For sparse rewards tasks, hindsight experience replay (HER) is applied to sampled transitions;

  • DDPG+HER: this is the state-of-the-art off-policy RL algorithm for the Fetch tasks. Deep deterministic policy gradient (DDPG) is trained with HER to deal with the sparse reward problem Andrychowicz et al. (2017).

(a) FetchReach (b) FetchPush (c) FetchPickAndPlace (d) FetchSlide
Figure 6: Results of ablation studies with or without using trajectory selection module on all Fetch environments.

5.4.1 Comparison to On-Policy Baselines

Figure 5, PPO+ESIL achieves reasonable results on all Fetch environments. In contrast, PPO, PPO+SIL and PPO+SIL+HER do not work on all tasks, with the exception of FetchReach. In comparison with the other selected tasks from the Fetch environments, FetchReach is relatively simple, because there is no object to be manipulated. For other tasks, it is quite difficult for the agent to achieve sufficient positive rewards during exploration, because of their rare occurrence. Although PPO+SIL utilises past good experiences to help exploration, it is still faced with the difficulty that past experiences do not easily achieve positive rewards. From the experiments (see Figure 5), PPO+SIL (no hindsight) converges much more slowly than using PPO only. Attempting to use only the original trajectories for self-imitation learning leads to unsatisfactory performance. For PPO+SIL+HER (with no episodic update), the sampled transitions are modified into hindsight experiences, achieving better performance in the FetchReach and FetchSlide tasks. However, this transition-based method still cannot solve the other two manipulation tasks. In contrast, the proposed PPO+ESIL, through utilizing episodic hindsight experiences from failed trajectories, can achieve positive rewards quickly at the start of training.

5.4.2 Ablation Study of Trajectory Selection Module

In order to investigate the effect of trajectory selection, ablation studies are performed to validate the selection strategy of our approach. Figure 6, when the trajectory selection module is not used, the  performance of the agent increases at first, and then starts to decrease. This suggests that the agent starts to converge to a sub-optimal location. However, Figure 6d, for the FetchSlide task, the agent converges faster without the trajectory selection module, and has better performance. This is likely to be because FetchSlide is the most difficult of the Fetch environments. During training, the agent is very unlikely to achieve positive rewards. Figure 7 also indicates that the value of in FetchSlide is higher than values in other environments, which means the majority of hindsight experiences have higher returns than original experiences. Thus, using more hindsight experiences (without filtering) accelerates training at this stage. Nonetheless, the trajectory selection module prevents the agent overfitting the hindsight experience in the other three tasks. Figure 7, shows the adaptive weight coefficient on all Fetch environments. When the trajectory selection module is used, the value of decreases with the increase in training epochs. This implies that the agent can achieve a greater proportion of the original goals in the latter stages of training, and fewer hindsight experiences are required for self-imitation learning.

(a) FetchReach (b) FetchPush (c) FetchPickAndPlace (d) FetchSlide
Figure 7: Variation in adaptive weight coefficient through the training on all Fetch environments.

5.4.3 Comparison to Off-Policy Baselines

Finally, the proposed method is also compared with a state-of-the-art off-policy algorithm: DDPG+HER. From Figure 8, it may be seen that DDPG+HER converges faster than PPO+ESIL in all tasks. However, PPO+ESIL obtains a similar performance to DDPG+HER. This is because DDPG+HER is an off-policy algorithm and uses a large number of hindsight experiences. A replay buffer is also employed to store samples collected in the past. This approach has better sample efficiency than on-policy algorithms such as PPO. Even so, Figure 8c shows that PPO+ESIL still outperforms DDPG+HER in the FetchPickAndPlace task and the success rate is close to 1. This suggests that PPO+ESIL approximates the characteristics of on-policy algorithms, which have low sample efficiency, but are able to obtain a comparable performance to off-policy algorithms in continuous control tasks Schulman et al. (2017).

(a) FetchReach (b) FetchPush (c) FetchPickAndPlace (d) FetchSlide
Figure 8: Results of comparison between PPO+ESIL and DDPG+HER on all Fetch environments.

5.5 Overall Performance

Table 1 shows the average success rate of the last 10 epochs during training of baseline methods and PPO+ESIL. The proposed ESIL achieves the best performance in four out of five tasks. However PPO and PPO+SIL only obtain reasonable results for the Empty Room and FetchReach tasks. With the assistance of HER, PPO+SIL+HER obtains a better performance in the FetchSlide task. For the off-policy methods of DDPG+HER, all five tasks are achieved, but a better performance is obtained than PPO+ESIL only in the FetchPush task.

Empty Room Reach Push Pick Slide
PPO 1.000 0.000 1.000 0.000 0.070 0.001 0.033 0.001 0.077 0.001
PPO + SIL 0.998 0.002 0.225 0.016 0.071 0.001 0.036 0.002 0.011 0.001
PPO + SIL + HER 0.996 0.013 1.000 0.000 0.066 0.011 0.035 0.004 0.276 0.011
DQN + HER 1.000 0.000 - - - -
DDPG + HER - 1.000 0.000 0.996 0.001 0.888 0.008 0.733 0.013
HPG 0.964 0.012 - - - -
PPO + ESIL (Ours) 1.000 0.000 1.000 0.000 0.984 0.003 0.986 0.002 0.812 0.015
Table 1: Average success rate standard error in the last 10 epochs over five random seeds on all environments (bold indicates the best result among all methods).

6 Conclusions

This paper proposed a novel method for self-imitation learning (SIL), in which an on-policy RL algorithm uses episodic modified past trajectories, i.e., hindsight experiences, to update policies. Compared with standard self-imitation learning, episodic self-imitation learning (ESIL) has a better performance in continuous control tasks where rewards are sparse. As far as we know, it is also the first time that hindsight experiences have been combined with state-of-the-art on-policy RL algorithms, such as PPO, to solve relatively hard exploration environments in continuous action spaces.

The experiments that we have conducted suggest that simply using self-imitation learning with the PPO algorithm, even with hindsight experience, leads to disappointing performance in continuous control Fetch tasks. In contrast, the episodic approach we take with ESIL is able to learn in these sparse reward settings. The auxiliary trajectory selection module and the adaptive weight help the training process to remove undesired experiences and balance the contributions to learning between the PPO term and the ESIL term automatically, and also increase the stability of training.

Our experiments suggest that the selection module is useful to prevent overfitting to sub-optimal hindsight experiences, but also that it does not always lead to learning a better policy faster. Despite this, selection filtering appears to support learning a useful policy in challenging environments. The experiments we have conducted to date have utilised relatively small networks, and it would be appropriate to extend the experiments to consider more complex observation spaces, and to actor/critic networks, which are consequently more elaborate.

Future work includes extending the proposed method to support hierarchical reinforcement learning (HRL) algorithms for more complex manipulation control tasks, such as in-hand manipulation. Episodic self-imitation learning (ESIL) can also be applied to simultaneously learn sub-goal policies.

This work was partly supported by the Engineering and Physical Sciences Research Council [grant number: EP/J021199/1].


  • P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In

    International Conference on Machine learning

    pp. 1. Cited by: §2.
  • M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: §1, §2, §4, 3rd item, 3rd item.
  • O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2020) Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §3.2.
  • K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath (2017) Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine 34 (6), pp. 26–38. Cited by: §1.
  • C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §3.2.
  • M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §2.
  • Y. Ding, C. Florensa, P. Abbeel, and M. Phielipp (2019) Goal-conditioned imitation learning. In Advances in Neural Information Processing Systems, pp. 15324–15335. Cited by: §2.
  • B. Fang, S. Jia, D. Guo, M. Xu, S. Wen, and F. Sun (2019a) Survey of imitation learning for robotic manipulation. International Journal of Intelligent Robotics and Applications, pp. 1–8. Cited by: §2.
  • M. Fang, C. Zhou, B. Shi, B. Gong, J. Xu, and T. Zhang (2019b) DHER: hindsight experience replay for dynamic goals. In International Conference on Learning Representations, Cited by: §2.
  • M. Fang, T. Zhou, Y. Du, L. Han, and Z. Zhang (2019c) Curriculum-guided hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 12623–12634. Cited by: §2.
  • C. Finn, S. Levine, and P. Abbeel (2016) Guided cost learning: deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58. Cited by: §2.
  • C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine (2017) One-shot visual imitation learning via meta-learning. In Conference on Robot Learning, Vol. 78, pp. 357–368. Cited by: §2.
  • C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel (2017) Reverse curriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300. Cited by: §1.
  • T. Gangwani, Q. Liu, and J. Peng (2018) Learning self-imitating diverse policies. arXiv preprint arXiv:1805.10309. Cited by: §2.
  • Y. Gao, J. Lin, F. Yu, S. Levine, T. Darrell, et al. (2018) Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313. Cited by: §1.
  • Y. Guo, J. Choi, M. Moczulski, S. Bengio, M. Norouzi, and H. Lee (2019) Self-imitation learning via trajectory-conditioned policy for hard-exploration tasks. arXiv, pp. arXiv–1907. Cited by: §2.
  • T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. (2018) Deep q-learning from demonstrations. In

    AAAI Conference on Artificial Intelligence

    Cited by: §1.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573. Cited by: §2, §2.
  • A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne (2017) Imitation learning: a survey of learning methods. ACM Computing Surveys 50 (2), pp. 21. Cited by: §2.
  • S. James, M. Bloesch, and A. J. Davison (2018) Task-embedded control networks for few-shot imitation learning. Conference on Robot Learning. Cited by: §1.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §1.
  • S. Y. Lee, C. Sungik, and S. Chung (2019) Sample-efficient deep reinforcement learning via episodic backward update. In Advances in Neural Information Processing Systems, pp. 2112–2121. Cited by: §1, §2, §4.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
  • H. Liu, A. Trott, R. Socher, and C. Xiong (2019a) Competitive experience replay. In International Conference on Learning Representations, Cited by: §1, §2.
  • N. Liu, T. Lu, Y. Cai, B. Li, and S. Wang (2019b) Hindsight generative adversarial imitation learning. arXiv preprint arXiv:1903.07854. Cited by: §2.
  • W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234, pp. 11–26. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §1.
  • A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Overcoming exploration in reinforcement learning with demonstrations. In IEEE International Conference on Robotics and Automation, pp. 6292–6299. Cited by: §1, §2.
  • A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In International Conference on Machine Learning, Vol. 99, pp. 278–287. Cited by: §1.
  • A. Y. Ng and S. J. Russell (2000) Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pp. 663–670. Cited by: §2.
  • J. Oh, Y. Guo, S. Singh, and H. Lee (2018) Self-imitation learning. In International Conference on Machine Learning, Cited by: §1, §2, §4.2, 2nd item, 2nd item.
  • M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, et al. (2018) Multi-goal reinforcement learning: challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464. Cited by: §1, §3.3, §5.1.
  • A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine (2017) Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §1, §2.
  • P. Rauber, A. Ummadisingu, F. Mutz, and J. Schmidhuber (2019) Hindsight policy gradients. In International Conference on Learning Representations, Cited by: §2, 4th item.
  • S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, pp. 627–635. Cited by: §2.
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.2, §4.1, 1st item, 1st item, §5.4.3.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction 2nd ed. MIT press. Cited by: §3.1.
  • Y. Tang (2020) Self-imitation learning via generalized lower bound q-learning. In Advances in Neural Information Processing Systems, Cited by: §2.
  • F. Torabi, G. Warnell, and P. Stone (2018) Behavioral cloning from observation. arXiv preprint arXiv:1805.01954. Cited by: §2.
  • M. Večerík, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller (2017) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817. Cited by: §1, §2.
  • Z. Wang, J. S. Merel, S. E. Reed, N. de Freitas, G. Wayne, and N. Heess (2017) Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, pp. 5320–5329. Cited by: §2.
  • Y. Wu, N. Charoenphakdee, H. Bao, V. Tangkaratt, and M. Sugiyama (2019) Imitation learning from imperfect demonstration. In International Conference on Machine learning, Cited by: §2.
  • H. Xu, Y. Gao, F. Yu, and T. Darrell (2017) End-to-end learning of driving models from large-scale video datasets. In

    IEEE International Conference on Computer Vision and Pattern Recognition

    pp. 2174–2182. Cited by: §2.
  • T. Zhang, Z. McCarthy, O. Jowl, D. Lee, X. Chen, K. Goldberg, and P. Abbeel (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In IEEE International Conference on Robotics and Automation, pp. 1–8. Cited by: §2.
  • R. Zhao and V. Tresp (2018) Energy-based hindsight experience prioritization. arXiv preprint arXiv:1810.01363. Cited by: §2.
  • B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning.. In AAAI Conference on Artificial Intelligence, Vol. 8, pp. 1433–1438. Cited by: §2.