Hindsight Generative Adversarial Imitation Learning

03/19/2019 ∙ by Naijun Liu, et al. ∙ 0

Compared to reinforcement learning, imitation learning (IL) is a powerful paradigm for training agents to learn control policies efficiently from expert demonstrations. However, in most cases, obtaining demonstration data is costly and laborious, which poses a significant challenge in some scenarios. A promising alternative is to train agent learning skills via imitation learning without expert demonstrations, which, to some extent, would extremely expand imitation learning areas. To achieve such expectation, in this paper, we propose Hindsight Generative Adversarial Imitation Learning (HGAIL) algorithm, with the aim of achieving imitation learning satisfying no need of demonstrations. Combining hindsight idea with the generative adversarial imitation learning (GAIL) framework, we realize implementing imitation learning successfully in cases of expert demonstration data are not available. Experiments show that the proposed method can train policies showing comparable performance to current imitation learning methods. Further more, HGAIL essentially endows curriculum learning mechanism which is critical for learning policies.



There are no comments yet.


page 1

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement learning (RL) has appeared as a promising method for solving complex decision-making tasks, such as video games[1], robot manipulation [2][3], and autonomous driving [4]. However, devising appropriate reward functions can be quite challenging for many applications [5]. Inverse reinforcement learning (IRL) [6] addresses the problem of learning reward functions from demonstration data, which is often considered as a branch of imitation learning (IL) [7]. Instead of learning reward functions, other methods of imitation learning were proposed to learn a policy directly from expert demonstrations.

Prior works addressed the IL problems by behavior cloning (BC) which reduces learning a policy from expert demonstrations to supervised learning

[8]. However, covariate shift always gives rise to compounding errors [9]. To overcome the drawbacks of BC, generative adversarial imitation learning (GAIL) algorithm [10] was proposed based on the formulation of generative adversarial networks (GAN) [11], where the generator is trained to generate expert-like samples and the discriminator is trained to distinguish between generated and real expert samples. GAIL is an appealing approach which is a highly effective and efficient learning framework for policy learning with unknown reward.

Fig. 1: The illustration of hindsight generative adversarial imitation learning (HGAIL) algorithm. Rolled-out trajectories are produced via policy (generator) G interacting with the environment. Expert-like demonstration data is converted from rolled-out trajectories with hindsight transformation technique. Regarding rolled-out trajectories as negative samples and treating expert-like demonstration data as positive samples, the discriminator is trained to distinguish between expert-like samples and negative samples.

Inevitably, demonstration data or usually high-quality demonstration data have to be provided for imitation learning paradigm [12]. However, gathering enough high-quality expert demonstrations in many scenarios is usually costly and difficult. To this end, some methods were proposed to apply imitation learning algorithms training policies by leveraging demonstration data as few as possible even reducing to a single demonstration[13].

Taking this step further and pursuing an alternative learning paradigm, in this paper, we consider the problem of whether the imitation learning algorithm can be employed successfully without demonstration available, and the final learned policy can show comparable performance to current imitation learning algorithms, which is in high promising.

In order to solve the suggested problem, a feasible way is to make the proposed algorithm intelligently self-synthesize expert-like demonstration data. To do so, we propose hindsight generative adversarial imitation learning (HGAIL) algorithm, which combines the idea of hindsight inspired from psychology [14] and hindsight experience replay (HER) [15] with GAIL into a unified learning framework. In the process of adversarial training, as illustrated in Figure 1, rolled-out trajectories are generated by policy G interacting with the environment. Expert-like samples are converted from rolled-out trajectories based on hindsight transformation, where the rolled-out trajectories are directly treated as negative samples without any change, which satisfies the requirements for training the discriminator and the generator. Our experimental results show that the proposed HGAIL allows the agent training to proceed on the rails with no demonstration data provided. The final learned policy shows comparable performance compared with current imitation learning methods.

Furthermore, our HGAIL algorithm essentially endows curriculum learning mechanism in the adversarial learning procedure. At different optimizing iteration steps, expert-like demonstration data are synthesized from different levels rolled-out trajectories data. Therefore, the rolled-out trajectories data and self-synthesized expert-like data are always appropriate for adversarial training, which make the process of adversarial policy learning stable and efficient. As shown in our experiments, the curriculum mechanism is crucial in improving the performance of the final learned policy.

In summary, our main contribution is a method of achieving imitation learning with no demonstration data available. Expert-like demonstration data are self-synthesized with the hindsight transformation mechanism under the proposed HGAIL learning framework. In addition, our method dynamically transforms the rolled-out trajectories data into expert-like data in the training process which ensures the hindsight transformed data is at the appropriate level for adversarial policy learning. To some extent, this latent learning mechanism automatically forms curriculum learning which is of great benefits for improving the performance of the learned policy. Our proposed HGAIL algorithm is also sample-efficiency as we only need rolled-out trajectories data generated by the agent interacting with the environment. No demonstration data or any other external data are required.

Ii Related Work

Ii-a Imitation Learning

Imitation learning algorithms can be classified into three broad categories: behavior cloning (BC), inverse reinforcement learning (IRL), and generative adversarial imitation learning (GAIL).

Behavior cloning reduces the imitation learning to supervised learning, which is simple and easy to be implemented [8]. However, BC needs huge amount of high-quality expert demonstrations [12].

Inverse reinforcement learning addresses the imitation learning problem by inferring a reward function from demonstration data and then using the learned reward function to train policy. Prior works in IRL includes maximum-margin [16][17] and maximum-entropy [18][19][20]formulations.

Generative adversarial imitation learning (GAIL) [10] is a recent imitation learning method inspired by generative adversarial networks (GAN) [11]. Another similar frameworks called guided cost learning (GCL) have also been proposed [21] for inverse reinforcement learning. As training GAIL is notoriously unstable, lots of works focus on improving stability and robustness by learning semantic policy embeddings [22] , via kernel mean embedding [23], or enforcing information bottleneck to constrain information flow in the discriminator [24]. More recent works extends the learning framework by improving on learning robust reward with state only [25] or with state-action pairs [26] in transferred setting for new policy learning.

Other works deal with actions being not available in the demonstration data [27], or capturing the latent structure underlying expert demonstrations [28] for imitation learning.

Ii-B Hindsight Experience Replay

Hindsight experience replay (HER) [15] is proposed for dealing with sparse rewards in reinforcement learning. The key insight of Hindsight Experience Replay (HER) is that even in failed rollouts where no valuable reward was obtained, the agent can transform them into successful ones by assuming that a state it saw in the rollout was the actual goal. Recent works have improved the performance of HER by rewarding hindsight experiences more [29] , combining curiosity and prioritization mechanism [30], or calculating trajectories energy based on work-energy in physics [31]. An extension of HER called dynamic hindsight experience replay (DHER) [32] is proposed to deal with dynamics goals.

Ii-C Learning with Few Data

Generally, training policies with imitation learning method needs expert demonstration data even with huge number or high quality. In some training scenarios, obtaining expert demonstration is not an easy thing [12]. Lots of work has been emerged to achieve the goal of making imitation learning algorithms work well with fewer demonstration data [33]. Such as meta-learning framework [34][35][36], neural task programming [37], and combining reinforcement with imitation learning [39]. Zero-shot learning is proposed for addressing visual demonstration without action [40].

In recent works, self-imitation learning methods [41][42] are proposed to train policies to reproduce the agent s past good experience without external demonstration provided. [43] proposes Generative Adversarial Self-imitation Learning (GASIL) method, which encourages the agent to imitate past good trajectories via generative adversarial imitation learning framework.

Instead of choosing the top-K trajectories according to episode return as the positive samples in GASIL [43], we employs the hindsight idea to directly transform the generator data to expert-like demonstration data. Experiment shows that our proposed method outperforms GASIL in robot’s reaching and grasping target objects scenarios (section IV-A).

Iii Method

To outline our method, we first consider a standard GAIL learning framework consisting of a policy (generator) , and discriminator , parameterized with and respectively. The goal of the policy is to generate rolled-out trajectories similar to demonstration trajectories, and the discriminator is to distinguish between state-action pairs sampled from the expert demonstration trajectories and the generator trajectories. The generator and the discriminator are optimized with the following objective function


where is expert policy, is the casual entropy of policy which encourage the policy sufficiently to explore the action space, and is the regularization weight.

In the concrete implementation of the proposed HGAIL approach, policy and discriminator are represented by multi-layer neural networks. The output

of the policy network parameterizes the Gaussian distribution policy

, where is the mean and is the covariance. At the beginning of each episode, our agent sample a goal and an initial state . At time step , the agent take an actions sampled from the Gaussian policy based on current policy , state and goal , where denotes concatenation. Then the agent moves to next state based on transition dynamics , and receives reward given by the discriminator. At the end of each episode, a trajectory sequence is generated, where is the length of trajectory. Repeating above procedure times, we obtain rolled-out trajectories .

In order to train the discriminator without expert demonstrations, our method leveraging the hindsight transformation technique (as shown in Algorithm 1) to convert the rolled-out trajectories into expert-like trajectories .

0:  Rolled-our trajectories

Hindsight transformation probability

Time index set for hindsight transformation
0:  Hindsight transformed trajectores
1:  for each trajectory in  do
2:     Set
3:     for each time step in  do
4:         Append to with probability
5:     end for
6:     for  in  do
7:        Randomly sample a achieved positon from to
8:        Set the new goal of state to be
9:     end for
10:     hindsight transformed trajectory
11:  end for
13:  return  
Algorithm 1 Synthesizing expert-like demonstrations with hindsight transformation

More specifically, the detail steps for self-synthesizing expert-like trajectories from rolled-out trajectories can be described as follows: firstly, for each trajectory in , we choose each time step with probability for making hindsight transformation, where . All the chosen time steps in for hindsight transformation is appended to . Secondly, for every time step in , we randomly set the new goal of state with the achieved position at state , where is randomly chosen from time step to in . In other words, we randomly set new goal of state with the position achieved after observe state . Then, we succeed in transforming the rolled trajectory into expert-like trajectory . Repeating above procedure until all trajectories are transformed.

The policy (generator) is optimized with policy gradient method proximal policy optimization PPO [44]. The objective function is


The gradient is given by


where is action-value function, is the reward function output from the discriminator.

The discriminator is optimized with the following function via minimizing the cross entropy


The gradient is given by

0:   Policy(generator) , discriminator
1:  Initialize , with random weights ,
2:  Run generating rolled-out trajectories
3:  Synthesize expert-like demonstration data (Algorithm 1)
4:  Pre-train using MLE on
5:  Pre-train via minimizing cross entropy between and
6:  repeat
7:     for   do
8:        Run policy generating rolled-out trajectories
9:        Update policy parameter :, where is shown in equation(3)
10:     end for
11:     for  do
12:        Use current to generate rolled-out trajectories
14:        Update discriminator parameter :, where is shown in equation(5)
15:     end for
16:  until HGAIL converges
Algorithm 2 Hindsight generative adversarial imitation learning

The fully detailed HGAIL algorithm is shown in Algorithm 2. At the beginning of the training, we generate trajectories using policy with random weights. Expert-like demonstration data is synthesized from with hindsight transformation technique, as is shown in Algorithm 1. is regarded as the negative samples and

is considered as positive samples. We use the maximum likelihood estimation (MLE) method to pre-train policy

on . We also pre-train discriminator via minimizing cross entropy between and . We found the pre-train procedure is beneficial for the policy training. After the pre-training procedure, optimizations over policy and discriminator are performed by alternating between policy gradient optimization steps to decrease (2) with respect to policy parameter and gradient step to increase (4) with respect to the discriminator parameter . Finally, the policy and the discriminator are both converged.

Iv Experiments and Results

In this section, our goal is to test whether the policies learned via our proposed HGAIL method works well without external demonstrations provided. In addition, ablation studies are conducted to show the influence of different mechanisms and hyper-parameters on the policy learning. Finally, experiments are carried on to test whether the final learned polies could be directly transfer to real-word physical system.

Iv-a Policy Learning

To test the feasibility of the proposed HGAIL method, experiments are carried out on two common robot s tasks: reaching target position and grasping target object [45] (as is shown in figure 2) in gym [46] environment. In order to make these two tasks more challenging, we pretend that only binary sparse reward is available in these two tasks. For reaching task, the reward is -1 for most states, and is 0 only when robot gripper reaching the target position. Similarly, for grasping target object task, the reward is -1 for most states, and is 0 only when robot gripper succeeds in grasping the target object.

We compare our proposed HGAIL algorithm against the following methods: (1) GAIL[10] with demonstrations available, denoted as GAIL-demo, (2) PPO[44], the state-of-the-art of policy gradient method, (3) GASIL[43], (4) HGAIL without hindsight transformation technique, denoted as HGAIL-no.

Fig. 2:

Two tasks implemented on Fetch robot in gym. Fetch robot owns seven degrees of freedom. In our experiments, the robot takes input as four dimensional action vector, the first three elements of action vector moving the end-effector (gripper) to three orthogonal directions, and the fourth one is controlling the gripper to be closed or open.

Left: Reaching task. The red point denotes the target position within the robot workspace. The fourth element of action vector is set to be fixed. Right: Grasping object task. The black cube is the target object to be picked. Best viewed in color

The performance of learned policies are measured in terms of two metrics: distance error and success rate. Distance error is measured by the distance between target position and gripper position at the end of each episode.


Success rate specifies the ratio of times successfully reaching target positions within allowed error to all times consumed, , as is shown in equation (5) (for reaching task) or grasping the desiered objects (for grasping task) to all times consumed


where is the indicator function taking true as input and giving 1 as output, and taking false as input and outputting 0.

Implementation details are available in Appendix VI-A. Learning curves on the performance of policies learned with HAGIL compared to above mentioned methods for robot reaching task and grasping tasks are shown in Figure 3 and Table 1 summarizes the performance of the final learned policies.

Fig. 3: Learning curves on performance of policies learned with HGAIL, GAIL with demonstration data provided (GAIL-demo), PPO, GASIL, and HGAIL without hindsight transformation (HGAIL-no). All the policies use the same network architecture and the same hyper parameters. The first row shows the success rates (left) and distance errors (right) of policies learned for reaching task respectively. The second row shows the success rates (left) and distance errors (right) of policies learned for grasping object task. Compared to the other algorithms, HGAIL shows the promising performance with no demonstration provided. Best viewed in color.

Compared to GAIL with demonstration available, our HGAIL algorithms shows comparable performance in term of success rates and final distance errors for both reaching task and grasping object task. However, the policies trained with our method for grasping task show slower convergence speed. To some extent, without demonstration data, the performance of policies trained with HGAIL are also promising. Compared to PPO, our algorithm shows much better performance. We can draw a conclusion that, although the component of policy in our algorithm is optimized via PPO, out of our HGAIL learning framework, PPO can’t train successful policies alone for tasks with binary sparse reward. Policies trained with GASIL show slower optimization speed and poorer polices performance in these two tasks. In comparison with HGAIL-no, HGAIL exhibits better performance, which indicates that hindsight transformation is crucial ingredient in our proposed HGAIL algorithm when demonstrations are not available. The results prove that HGAIL can work well without demonstration data available, and learn successful policies. We can also see that, as our algorithm essentially endows curriculum learning mechanism, at the beginning period of policy training, HGAIL shows the faster optimization than GAIL algorithm with demonstrations available.

Method Reaching task Grasping object task
success rate distance error success rate distance error


TABLE I: Performance of policies trained with different algorithms

Iv-B Ablation Studies

In our experiments, ablation studies on reaching task and grasping target object task show the similar conclusions. Consequently, in order to make the content of the paper more concise and compact, by default, we mainly show the results of experiments on reaching task in this section.

Iv-B1 Curriculum Learning or Not

In HGAIL learning framework, hindsight transformed data (expert-like data) is converted from various levels of data rolled out by different-levels’ generator in the procedure of adversarial learning. To some extent, our HGAIL learning paradigm essentially endows training agent policy with curriculum learning mechanism.

In order to show whether this curriculum learning mechanism is crucial for policy training, experiments on policies trained without curriculum learning mechanism is conducted. Concretely, in the ablation experiments, hindsight transformed data (expert-like data) are transformed only from rolled-out trajectories produced by the policy at the beginning of training period. Learning curves on success rates and distance errors are shown in Figure 4 and Table 2 summaries the performance of the final trained policies. As illustrated, policy trained with curriculum learning mechanism shows the better performance with respect to both success rate and distance error, and the learning process is more stable.

Fig. 4: Learning curves on policy performance with respect to employing curriculum learning mechanism or not vs iteration steps. Left: success rate. Right: distance errors. As is shown, policies trained with curriculum shows the better performance. At the same time, the training process with curriculum mechanism is more stable. Best viewed in color.
Method Success rates Distance errors ()
Curriculum learning
No curriculum learning
TABLE II: Performance of policies trained with curriculum learning mechanism or not

Iv-B2 Formation of Hindsight Transformation

Inspired from HER [15], we propose two different strategies for hindsight transformation called final hindsight transformation and future hindsight transformation respectively. Final hindsight transformation replaces the goal of each state with the position of the final reached state in its own episode. However, Future hindsight transformation is randomly changing target position of each state with the position of state observed after it, as shown in Algorithm 1.

Learning curves in terms of two different hindsight transformation are shown in Figure 5 and Table 3 summarizes the final policies performance. Final hindsight transformation can t work well and the policy learned with it divergent gradually in the training procedure.

Fig. 5: Learning curves on policy performance based on future hindsight transformation (Future) and final hindsight transformation (Final) vs iteration steps. Left: success rates. Right: distance errors. Results show that policy trained with future hindsight transformation exhibits better performance. The training procedure is more stable with future hindsight transformation. Best viewed in color.
Hindsight transformation Success rates Distance errors ()
TABLE III: Performance of policies trained with different hindsight transformation

Iv-B3 Hindsight Transformation Probability

So far, the performance of all the learned policies were trained with hindsight transformation probability . We also interested in the effect of value on the performance of the final learned policy. Experiments are carried out with being 0.2, 0.4, 0.6, 0.8 and 1. Learning curves are shown in Figure 6. The results illustrates that converting each state into hindsight transformation with probability 1 performs best, which is different from HER [15]. The bigger value of the hindsight transformation probability is, the better performance the final learned policy demonstrates.

Fig. 6: Learning curves on the performance of policies learned with being 0.2, 0.4, 0.6, 0.8, and 1 vs iteration steps. Left: success rates. Right: distance errors. Policies learned with hindsight transformation probability being 1 shows the best performance in terms of success rate and distance error. The bigger the hindsight transformation probability is, the better performance the trained policy demonstrates. Best viewed in color.

Iv-B4 Reward Formation

Different reward function for policy learning in HGAIL learning framework is experimentally analyzed. In the GAIL learning framework, different reward formations for has been applied [9][24]. We compare four common reward functions written as , , , , where

denotes sigmoid function,

is the output of discriminator taking state and action pairs as input, and is clipping to . The results are illustrated in Figure 7. The policies learned from reward converged fastest compared to other three reward functions. , , and guides the final learned policies exhibits similar better performance in terms of distance error in contrast to . The policies learned from show the best performance with respect to not only in iteration steps consumed for policy training, but also in higher success rates and lower distance errors. As a result, in our work, we choose as our default reward function for policy learning.

Fig. 7: Learning curves on policy performance with respect to four different reward functions vs iteration steps. Left: success rates. Right: distance errors. As is shown, policies learned from show the best performance not only in iteration steps consumed for policy training converged, but also in higher success rates and lower distance error. Best viewed in color.

Iv-C Sim to Real Policy Transfer

To validate the feasibility of the policy trained with our algorithm deployed in real-world physical system (no additional training). Experiment are conducted on real-world UR5 robot (the only robot arm available in our lab). The detail implementation of experiments is shown in Appendix VI-B. As is shown in Figure 8, the position of red ball is the target position for reaching task, and The pink cube is the target object to be grasped for grasping object task. Frames of UR5 robot employing learned policy in reaching target position and grasping target object are pictured respectively, as is shown in Figure 8. Success rates and distance errors are summarized in Table 4. Results show that policy learned with HGAIL can successfully transfer from simulated environment to real-word scenarios, and the performance in real-world scenarios is consistent with simulated environment without additional training.

Task Success rates Distance errors ()
TABLE IV: Performance of policies employed in real-word scenarios
Fig. 8: Frames of learned policy employed on real-world UR5 robot for reaching task (a) and grasping target object (b). We using the red ball define the target position for reaching task. The pink cube is the target object to be grasped for grasping object task. Robot succeeded in reaching target position and grasping the target object with high accuracies.

V Conclusion

We propose HGAIL algorithm which is a new learning paradigm under GAIL learning framework for learning control policy without expert demonstration available. We adopt hindsight transformation mechanism to self- synthesize expert-like demonstration data for adversarial policy learning. Experimental results show that the proposed method efficiently and effectively trains policies. In addition, hindsight transition technique essentially endowing curriculum learning mechanism under our learning framework is critical for policy learning. We also validate the feasibility of the policy trained with our algorithm directly deployed in real-world robot without additional training.

In the future, we want to employ our method in more continuous and discrete environments. A promising line is directly applying our method in training manipulation skills on real-world robot, as the amount of training interaction data is relatively small. Another exciting direction is to combine the HGAIL algorithm with hierarchy to solve more complicated tasks.


Vi Appendix

Vi-a Implementation Details

In this section, we provide additional details about the experimental tasks setup and hyper-parameters.

Vi-A1 Generator

We use two layer tanh neural network with 64 units for the value network and policy network. The policy network take input as a concatenated vector with gripper position, gripper velocity, and target position. The policy network s out parameterizes the Gaussian policy distribution, where the mean is the output of the policy network and the fixed covariance was set to be 1.

Vi-A2 Discriminator

We use two-layer tanh neural network with 100 units in each layer for the discriminator .

We set , . Learning rate for discriminator is 0.0004, and learning rate for generator is . Batch size is set to be 64 for discriminator optimization, and 128 for generator optimization. The pre-train steps for generator is 100 and for discriminator is 500. For fair comparison, all experiments were run in a single thread, all of the algorithms( HGAIL, GAIL-demo, GASIL, and HGAIL-no) shares the same network architecture and the same hyper parameters and PPO share these parameters with generator.

It should be mentioned that, if not clearly indicated in the paper, all our parameters about ablation studies are set to the following default values: Hindsight transformation probability , reward function is .

Vi-B Transfer to real-world robot

The final learned policy is directly transferred from simulated environment to real-word UR5 robot without additional training. As show in figure 8, we use different object to define the target position. In our working scenario, RGB-D image can be obtained from depth camera installed above the robot. Another trained deep neural network (VGG-16) output object pixel position . The target object position under robot coordinate systems can be obtained by the following equation


where is camera inner parameter matrix, is depth value with respect to the pixel position , and and are the rotation matrix and transformation vector from the camera coordinate system to the robot coordinate system respectively. At time step , the gripper s position , velocity , target object position p are contacted into a single vector fed into the policy network, which is similar to training of fetch arm in simulated environment. The mean of the output of Gaussian policy is send to robot controller, and UR5 robot gripper moved to the next step position . The above procedure is repeated until the ending of the episode.