I Introduction
Reinforcement learning (RL) has appeared as a promising method for solving complex decisionmaking tasks, such as video games[1], robot manipulation [2][3], and autonomous driving [4]. However, devising appropriate reward functions can be quite challenging for many applications [5]. Inverse reinforcement learning (IRL) [6] addresses the problem of learning reward functions from demonstration data, which is often considered as a branch of imitation learning (IL) [7]. Instead of learning reward functions, other methods of imitation learning were proposed to learn a policy directly from expert demonstrations.
Prior works addressed the IL problems by behavior cloning (BC) which reduces learning a policy from expert demonstrations to supervised learning
[8]. However, covariate shift always gives rise to compounding errors [9]. To overcome the drawbacks of BC, generative adversarial imitation learning (GAIL) algorithm [10] was proposed based on the formulation of generative adversarial networks (GAN) [11], where the generator is trained to generate expertlike samples and the discriminator is trained to distinguish between generated and real expert samples. GAIL is an appealing approach which is a highly effective and efficient learning framework for policy learning with unknown reward.Inevitably, demonstration data or usually highquality demonstration data have to be provided for imitation learning paradigm [12]. However, gathering enough highquality expert demonstrations in many scenarios is usually costly and difficult. To this end, some methods were proposed to apply imitation learning algorithms training policies by leveraging demonstration data as few as possible even reducing to a single demonstration[13].
Taking this step further and pursuing an alternative learning paradigm, in this paper, we consider the problem of whether the imitation learning algorithm can be employed successfully without demonstration available, and the final learned policy can show comparable performance to current imitation learning algorithms, which is in high promising.
In order to solve the suggested problem, a feasible way is to make the proposed algorithm intelligently selfsynthesize expertlike demonstration data. To do so, we propose hindsight generative adversarial imitation learning (HGAIL) algorithm, which combines the idea of hindsight inspired from psychology [14] and hindsight experience replay (HER) [15] with GAIL into a unified learning framework. In the process of adversarial training, as illustrated in Figure 1, rolledout trajectories are generated by policy G interacting with the environment. Expertlike samples are converted from rolledout trajectories based on hindsight transformation, where the rolledout trajectories are directly treated as negative samples without any change, which satisfies the requirements for training the discriminator and the generator. Our experimental results show that the proposed HGAIL allows the agent training to proceed on the rails with no demonstration data provided. The final learned policy shows comparable performance compared with current imitation learning methods.
Furthermore, our HGAIL algorithm essentially endows curriculum learning mechanism in the adversarial learning procedure. At different optimizing iteration steps, expertlike demonstration data are synthesized from different levels rolledout trajectories data. Therefore, the rolledout trajectories data and selfsynthesized expertlike data are always appropriate for adversarial training, which make the process of adversarial policy learning stable and efficient. As shown in our experiments, the curriculum mechanism is crucial in improving the performance of the final learned policy.
In summary, our main contribution is a method of achieving imitation learning with no demonstration data available. Expertlike demonstration data are selfsynthesized with the hindsight transformation mechanism under the proposed HGAIL learning framework. In addition, our method dynamically transforms the rolledout trajectories data into expertlike data in the training process which ensures the hindsight transformed data is at the appropriate level for adversarial policy learning. To some extent, this latent learning mechanism automatically forms curriculum learning which is of great benefits for improving the performance of the learned policy. Our proposed HGAIL algorithm is also sampleefficiency as we only need rolledout trajectories data generated by the agent interacting with the environment. No demonstration data or any other external data are required.
Ii Related Work
Iia Imitation Learning
Imitation learning algorithms can be classified into three broad categories: behavior cloning (BC), inverse reinforcement learning (IRL), and generative adversarial imitation learning (GAIL).
Behavior cloning reduces the imitation learning to supervised learning, which is simple and easy to be implemented [8]. However, BC needs huge amount of highquality expert demonstrations [12].
Inverse reinforcement learning addresses the imitation learning problem by inferring a reward function from demonstration data and then using the learned reward function to train policy. Prior works in IRL includes maximummargin [16][17] and maximumentropy [18][19][20]formulations.
Generative adversarial imitation learning (GAIL) [10] is a recent imitation learning method inspired by generative adversarial networks (GAN) [11]. Another similar frameworks called guided cost learning (GCL) have also been proposed [21] for inverse reinforcement learning. As training GAIL is notoriously unstable, lots of works focus on improving stability and robustness by learning semantic policy embeddings [22] , via kernel mean embedding [23], or enforcing information bottleneck to constrain information flow in the discriminator [24]. More recent works extends the learning framework by improving on learning robust reward with state only [25] or with stateaction pairs [26] in transferred setting for new policy learning.
IiB Hindsight Experience Replay
Hindsight experience replay (HER) [15] is proposed for dealing with sparse rewards in reinforcement learning. The key insight of Hindsight Experience Replay (HER) is that even in failed rollouts where no valuable reward was obtained, the agent can transform them into successful ones by assuming that a state it saw in the rollout was the actual goal. Recent works have improved the performance of HER by rewarding hindsight experiences more [29] , combining curiosity and prioritization mechanism [30], or calculating trajectories energy based on workenergy in physics [31]. An extension of HER called dynamic hindsight experience replay (DHER) [32] is proposed to deal with dynamics goals.
IiC Learning with Few Data
Generally, training policies with imitation learning method needs expert demonstration data even with huge number or high quality. In some training scenarios, obtaining expert demonstration is not an easy thing [12]. Lots of work has been emerged to achieve the goal of making imitation learning algorithms work well with fewer demonstration data [33]. Such as metalearning framework [34][35][36], neural task programming [37], and combining reinforcement with imitation learning [39]. Zeroshot learning is proposed for addressing visual demonstration without action [40].
In recent works, selfimitation learning methods [41][42] are proposed to train policies to reproduce the agent s past good experience without external demonstration provided. [43] proposes Generative Adversarial Selfimitation Learning (GASIL) method, which encourages the agent to imitate past good trajectories via generative adversarial imitation learning framework.
Instead of choosing the topK trajectories according to episode return as the positive samples in GASIL [43], we employs the hindsight idea to directly transform the generator data to expertlike demonstration data. Experiment shows that our proposed method outperforms GASIL in robot’s reaching and grasping target objects scenarios (section IVA).
Iii Method
To outline our method, we first consider a standard GAIL learning framework consisting of a policy (generator) , and discriminator , parameterized with and respectively. The goal of the policy is to generate rolledout trajectories similar to demonstration trajectories, and the discriminator is to distinguish between stateaction pairs sampled from the expert demonstration trajectories and the generator trajectories. The generator and the discriminator are optimized with the following objective function
(1) 
where is expert policy, is the casual entropy of policy which encourage the policy sufficiently to explore the action space, and is the regularization weight.
In the concrete implementation of the proposed HGAIL approach, policy and discriminator are represented by multilayer neural networks. The output
of the policy network parameterizes the Gaussian distribution policy
, where is the mean and is the covariance. At the beginning of each episode, our agent sample a goal and an initial state . At time step , the agent take an actions sampled from the Gaussian policy based on current policy , state and goal , where denotes concatenation. Then the agent moves to next state based on transition dynamics , and receives reward given by the discriminator. At the end of each episode, a trajectory sequence is generated, where is the length of trajectory. Repeating above procedure times, we obtain rolledout trajectories .In order to train the discriminator without expert demonstrations, our method leveraging the hindsight transformation technique (as shown in Algorithm 1) to convert the rolledout trajectories into expertlike trajectories .
More specifically, the detail steps for selfsynthesizing expertlike trajectories from rolledout trajectories can be described as follows: firstly, for each trajectory in , we choose each time step with probability for making hindsight transformation, where . All the chosen time steps in for hindsight transformation is appended to . Secondly, for every time step in , we randomly set the new goal of state with the achieved position at state , where is randomly chosen from time step to in . In other words, we randomly set new goal of state with the position achieved after observe state . Then, we succeed in transforming the rolled trajectory into expertlike trajectory . Repeating above procedure until all trajectories are transformed.
The policy (generator) is optimized with policy gradient method proximal policy optimization PPO [44]. The objective function is
(2) 
The gradient is given by
(3) 
where is actionvalue function, is the reward function output from the discriminator.
The discriminator is optimized with the following function via minimizing the cross entropy
(4) 
The gradient is given by
(5) 
The fully detailed HGAIL algorithm is shown in Algorithm 2. At the beginning of the training, we generate trajectories using policy with random weights. Expertlike demonstration data is synthesized from with hindsight transformation technique, as is shown in Algorithm 1. is regarded as the negative samples and
is considered as positive samples. We use the maximum likelihood estimation (MLE) method to pretrain policy
on . We also pretrain discriminator via minimizing cross entropy between and . We found the pretrain procedure is beneficial for the policy training. After the pretraining procedure, optimizations over policy and discriminator are performed by alternating between policy gradient optimization steps to decrease (2) with respect to policy parameter and gradient step to increase (4) with respect to the discriminator parameter . Finally, the policy and the discriminator are both converged.Iv Experiments and Results
In this section, our goal is to test whether the policies learned via our proposed HGAIL method works well without external demonstrations provided. In addition, ablation studies are conducted to show the influence of different mechanisms and hyperparameters on the policy learning. Finally, experiments are carried on to test whether the final learned polies could be directly transfer to realword physical system.
Iva Policy Learning
To test the feasibility of the proposed HGAIL method, experiments are carried out on two common robot s tasks: reaching target position and grasping target object [45] (as is shown in figure 2) in gym [46] environment. In order to make these two tasks more challenging, we pretend that only binary sparse reward is available in these two tasks. For reaching task, the reward is 1 for most states, and is 0 only when robot gripper reaching the target position. Similarly, for grasping target object task, the reward is 1 for most states, and is 0 only when robot gripper succeeds in grasping the target object.
We compare our proposed HGAIL algorithm against the following methods: (1) GAIL[10] with demonstrations available, denoted as GAILdemo, (2) PPO[44], the stateoftheart of policy gradient method, (3) GASIL[43], (4) HGAIL without hindsight transformation technique, denoted as HGAILno.
Two tasks implemented on Fetch robot in gym. Fetch robot owns seven degrees of freedom. In our experiments, the robot takes input as four dimensional action vector, the first three elements of action vector moving the endeffector (gripper) to three orthogonal directions, and the fourth one is controlling the gripper to be closed or open.
Left: Reaching task. The red point denotes the target position within the robot workspace. The fourth element of action vector is set to be fixed. Right: Grasping object task. The black cube is the target object to be picked. Best viewed in colorThe performance of learned policies are measured in terms of two metrics: distance error and success rate. Distance error is measured by the distance between target position and gripper position at the end of each episode.
(6) 
Success rate specifies the ratio of times successfully reaching target positions within allowed error to all times consumed, , as is shown in equation (5) (for reaching task) or grasping the desiered objects (for grasping task) to all times consumed
(7) 
where is the indicator function taking true as input and giving 1 as output, and taking false as input and outputting 0.
Implementation details are available in Appendix VIA. Learning curves on the performance of policies learned with HAGIL compared to above mentioned methods for robot reaching task and grasping tasks are shown in Figure 3 and Table 1 summarizes the performance of the final learned policies.
Compared to GAIL with demonstration available, our HGAIL algorithms shows comparable performance in term of success rates and final distance errors for both reaching task and grasping object task. However, the policies trained with our method for grasping task show slower convergence speed. To some extent, without demonstration data, the performance of policies trained with HGAIL are also promising. Compared to PPO, our algorithm shows much better performance. We can draw a conclusion that, although the component of policy in our algorithm is optimized via PPO, out of our HGAIL learning framework, PPO can’t train successful policies alone for tasks with binary sparse reward. Policies trained with GASIL show slower optimization speed and poorer polices performance in these two tasks. In comparison with HGAILno, HGAIL exhibits better performance, which indicates that hindsight transformation is crucial ingredient in our proposed HGAIL algorithm when demonstrations are not available. The results prove that HGAIL can work well without demonstration data available, and learn successful policies. We can also see that, as our algorithm essentially endows curriculum learning mechanism, at the beginning period of policy training, HGAIL shows the faster optimization than GAIL algorithm with demonstrations available.
Method  Reaching task  Grasping object task  

success rate  distance error  success rate  distance error  
GAILdemo 

GASIL  
PP0  
HGAIL(ours)  
HGAILno  

IvB Ablation Studies
In our experiments, ablation studies on reaching task and grasping target object task show the similar conclusions. Consequently, in order to make the content of the paper more concise and compact, by default, we mainly show the results of experiments on reaching task in this section.
IvB1 Curriculum Learning or Not
In HGAIL learning framework, hindsight transformed data (expertlike data) is converted from various levels of data rolled out by differentlevels’ generator in the procedure of adversarial learning. To some extent, our HGAIL learning paradigm essentially endows training agent policy with curriculum learning mechanism.
In order to show whether this curriculum learning mechanism is crucial for policy training, experiments on policies trained without curriculum learning mechanism is conducted. Concretely, in the ablation experiments, hindsight transformed data (expertlike data) are transformed only from rolledout trajectories produced by the policy at the beginning of training period. Learning curves on success rates and distance errors are shown in Figure 4 and Table 2 summaries the performance of the final trained policies. As illustrated, policy trained with curriculum learning mechanism shows the better performance with respect to both success rate and distance error, and the learning process is more stable.
Method  Success rates  Distance errors () 

Curriculum learning  
No curriculum learning 
IvB2 Formation of Hindsight Transformation
Inspired from HER [15], we propose two different strategies for hindsight transformation called final hindsight transformation and future hindsight transformation respectively. Final hindsight transformation replaces the goal of each state with the position of the final reached state in its own episode. However, Future hindsight transformation is randomly changing target position of each state with the position of state observed after it, as shown in Algorithm 1.
Learning curves in terms of two different hindsight transformation are shown in Figure 5 and Table 3 summarizes the final policies performance. Final hindsight transformation can t work well and the policy learned with it divergent gradually in the training procedure.
Hindsight transformation  Success rates  Distance errors () 

Future  
Final 
IvB3 Hindsight Transformation Probability
So far, the performance of all the learned policies were trained with hindsight transformation probability . We also interested in the effect of value on the performance of the final learned policy. Experiments are carried out with being 0.2, 0.4, 0.6, 0.8 and 1. Learning curves are shown in Figure 6. The results illustrates that converting each state into hindsight transformation with probability 1 performs best, which is different from HER [15]. The bigger value of the hindsight transformation probability is, the better performance the final learned policy demonstrates.
IvB4 Reward Formation
Different reward function for policy learning in HGAIL learning framework is experimentally analyzed. In the GAIL learning framework, different reward formations for has been applied [9][24]. We compare four common reward functions written as , , , , where
denotes sigmoid function,
is the output of discriminator taking state and action pairs as input, and is clipping to . The results are illustrated in Figure 7. The policies learned from reward converged fastest compared to other three reward functions. , , and guides the final learned policies exhibits similar better performance in terms of distance error in contrast to . The policies learned from show the best performance with respect to not only in iteration steps consumed for policy training, but also in higher success rates and lower distance errors. As a result, in our work, we choose as our default reward function for policy learning.IvC Sim to Real Policy Transfer
To validate the feasibility of the policy trained with our algorithm deployed in realworld physical system (no additional training). Experiment are conducted on realworld UR5 robot (the only robot arm available in our lab). The detail implementation of experiments is shown in Appendix VIB. As is shown in Figure 8, the position of red ball is the target position for reaching task, and The pink cube is the target object to be grasped for grasping object task. Frames of UR5 robot employing learned policy in reaching target position and grasping target object are pictured respectively, as is shown in Figure 8. Success rates and distance errors are summarized in Table 4. Results show that policy learned with HGAIL can successfully transfer from simulated environment to realword scenarios, and the performance in realworld scenarios is consistent with simulated environment without additional training.
Task  Success rates  Distance errors () 

Reaching  
Picking 
V Conclusion
We propose HGAIL algorithm which is a new learning paradigm under GAIL learning framework for learning control policy without expert demonstration available. We adopt hindsight transformation mechanism to self synthesize expertlike demonstration data for adversarial policy learning. Experimental results show that the proposed method efficiently and effectively trains policies. In addition, hindsight transition technique essentially endowing curriculum learning mechanism under our learning framework is critical for policy learning. We also validate the feasibility of the policy trained with our algorithm directly deployed in realworld robot without additional training.
In the future, we want to employ our method in more continuous and discrete environments. A promising line is directly applying our method in training manipulation skills on realworld robot, as the amount of training interaction data is relatively small. Another exciting direction is to combine the HGAIL algorithm with hierarchy to solve more complicated tasks.
References
 [1] Mnih V, Kavukcuoglu K, Silver D, et al, Humanlevel control through deep reinforcement learning, Nature, vol. 518, no. 7540, pp. 529533, 2015.
 [2] Andrychowicz M, Baker B, Chociej M, et al. Learning dexterous inhand manipulation, arXiv preprint arXiv: 1808.00177, 2018.

[3]
Levine S, Finn C, Darrell T, et al. Endtoend training of deep visuomotor policies[J]. The Journal of Machine Learning Research, vol. 17, no. 1, pp. 13341373, 2016.
 [4] Kuderer M, Gulati S, Burgard W. Learning driving styles for autonomous vehicles from demonstration, IEEE International Conference on Robotics and Automation. 2015: 26412646.
 [5] Qureshi A H, Yip M C. Adversarial Imitation via Variational Inverse Reinforcement Learning, International Conference on Learning Representations, 2019.
 [6] Ng A, Russell S. Algorithms for inverse reinforcement learning, International Conference on Machine Learning. 2000: 663670.
 [7] Argall B D, Chernova S, Veloso M, et al. A survey of robot learning from demonstration, Robotics and autonomous systems, vol. 57, no. 5, pp. 469483, 2009.
 [8] Nejati N, Langley P, Konik T. Learning hierarchical task networks by observation, International Conference on Machine Learning. 2006: 665672.

[9]
Ross S, Gordon G, Bagnell D. A reduction of imitation learning and structured prediction to noregret online learning, In Proceedings of the fourteenth International Conference on Artificial Intelligence and Statistics, 2011: 627635.
 [10] Ho J, Ermon S. Generative adversarial imitation learning, In Advances in Neural Information Processing Systems. 2016: 45654573.
 [11] Goodfellow I, PougetAbadie J, Mirza M, et al. Generative adversarial nets, In Advances in Neural Information Processing Systems. 2014: 26722680.
 [12] Zhang T, McCarthy Z, Jowl O, et al. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation, IEEE International Conference on Robotics and Automation. 2018: 56285635.
 [13] Duan Y, Andrychowicz M, Stadie B, et al. Oneshot imitation learning, In Advances in Neural Information Processing Systems. 2017: 10871098.
 [14] Fischhoff B. Hindsight is not equal to foresight: The effect of outcome knowledge on judgment under uncertainty[J]. Journal of Experimental Psychology: Human perception and performance, vol. 1, no. 3, pp. 288, 1975.
 [15] Andrychowicz M, Wolski F, Ray A, et al. Hindsight experience replay, In Advances in Neural Information Processing Systems. 2017: 50485058.
 [16] Abbeel P, Ng A. Apprenticeship learning via inverse reinforcement learning, International Conference on Machine Learning. 2004: 1.
 [17] Ratliff N D, Bagnell J A, Zinkevich M A. Maximum margin planning, International Conference on Machine Learning. 2006: 729736.
 [18] Ziebart B D, Maas A L, Bagnell J A, et al. Maximum entropy inverse reinforcement learning, In TwentyThird AAAI Conference on Artificial Intelligence. 2008: 14331438.
 [19] Bloem M, Bambos N. Infinite time horizon maximum causal entropy inverse reinforcement learning, IEEE Conference on Decision and Control. 2014: 49114916.
 [20] Boularias A, Kober J, Peters J. Relative entropy inverse reinforcement learning, In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 2011: 182189.
 [21] Finn C, Levine S, Abbeel P. Guided cost learning: Deep inverse optimal control via policy optimization, International Conference on Machine Learning. 2016: 4958.
 [22] Wang Z, Merel J S, Reed S E, et al. Robust imitation of diverse behaviors, In Advances in Neural Information Processing Systems. 2017: 53205329.
 [23] Kim K E, Park H S. Imitation learning via kernel mean embedding, In ThirtySecond AAAI Conference on Artificial Intelligence. 2018.
 [24] Peng X B, Kanazawa A, Toyer S, et al. Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow, arXiv preprint arXiv: 1810.00821, 2018.
 [25] Fu J, Luo K, Levine S. Learning robust rewards with adversarial inverse reinforcement learning, International Conference on Learning Representations, 2017.
 [26] Qureshi A H, Boots B, Yip M C. Adversarial imitation via variational inverse reinforcement learning, International Conference on Learning Representations, 2019.
 [27] Torabi F, Warnell G, Stone P. Generative adversarial imitation from observation, arXiv preprint arXiv: 1807.06158, 2018.
 [28] Li Y, Song J, Ermon S. InfoGAIL: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems. 2017: 38123822.
 [29] Lanka S, Wu T. ARCHER: Aggressive rewards to counter bias in hindsight experience replay, arXiv preprint arXiv: 1809.02070, 2018.
 [30] Zhao R, Tresp V. Curiositydriven experience prioritization via density estimation, arXiv preprint arXiv: 1902.08039, 2019.
 [31] Zhao R, Tresp V. Energybased hindsight experience prioritization, In 2nd Conference on Robot Learning, 2018.
 [32] Fang M, Zhou C, Shi B, et al. DHER: Hindsight experience replay for dynamic goals, International Conference on Learning Representations, 2019.
 [33] Finn C, Yu T, Zhang T, et al. Oneshot visual imitation learning via metalearning, In 1st Conference on Robot Learning. 2017: 357368.
 [34] Finn C, Abbeel P, Levine S. Modelagnostic metalearning for fast adaptation of deep networks, In Advances in Neural Information Processing Systems. 2017: 11261135.
 [35] Yu T, Finn C, Xie A, et al. Oneshot imitation from observing humans via domainadaptive metalearning, arXiv preprint arXiv: 1802.01557, 2018.
 [36] Finn C, Yu T, Zhang T, et al. OneShot Visual Imitation Learning via MetaLearning, In Conference on Robot Learning. 2017: 357368.
 [37] Xu D, Nair S, Zhu Y, et al. Neural task programming: Learning to generalize across hierarchical tasks, IEEE International Conference on Robotics and Automation. 2018: 37953802.
 [38] Wang Z, Merel J S, Reed S E, et al. Robust imitation of diverse behaviors, In Advances in Neural Information Processing Systems. 2017: 53205329.
 [39] Zhu Y, Wang Z, Merel J S, et al. Reinforcement and imitation learning for diverse visuomotor skills, arXiv preprint arXiv: 1802.09564, 2018.

[40]
Pathak D, Mahmoudieh P, Luo G, et al. Zeroshot visual imitation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018: 20502053.
 [41] Oh J, Guo Y, Singh S, et al. Selfimitation learning[J]. Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018.
 [42] Gangwani T, Liu Q, Peng J. Learning selfimitating diverse policies, International Conference on Learning Representations, 2019.
 [43] Guo Y, Oh J, Singh S, et al. Generative adversarial selfimitation learning, arXiv preprint arXiv: 1812.00950, 2018.
 [44] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms, arXiv preprint arXiv: 1707.06347, 2017.
 [45] Plappert M, Andrychowicz M, Ray A, et al. Multigoal reinforcement learning: Challenging robotics environments and request for research, arXiv preprint arXiv: 1802.09464, 2018.
 [46] Brockman G, Cheung V, Pettersson L, et al. Openai gym, arXiv preprint arXiv: 1606.01540, 2016.
 [47] Schulman J, Levine S, Abbeel P, et al. Trust Region Policy Optimization, International Conference on Machine Learning. 2015: 18891897.
 [48] Hester T, Vecerik M, Pietquin O, et al. Deep Qlearning from demonstrations, In ThirtySecond AAAI Conference on Artificial Intelligence. 2018.
 [49] Che T, Li Y, Jacob A P, et al. Mode regularized generative adversarial networks, International Conference on Learning Representations, 2017.
Vi Appendix
Via Implementation Details
In this section, we provide additional details about the experimental tasks setup and hyperparameters.
ViA1 Generator
We use two layer tanh neural network with 64 units for the value network and policy network. The policy network take input as a concatenated vector with gripper position, gripper velocity, and target position. The policy network s out parameterizes the Gaussian policy distribution, where the mean is the output of the policy network and the fixed covariance was set to be 1.
ViA2 Discriminator
We use twolayer tanh neural network with 100 units in each layer for the discriminator .
We set , . Learning rate for discriminator is 0.0004, and learning rate for generator is . Batch size is set to be 64 for discriminator optimization, and 128 for generator optimization. The pretrain steps for generator is 100 and for discriminator is 500. For fair comparison, all experiments were run in a single thread, all of the algorithms( HGAIL, GAILdemo, GASIL, and HGAILno) shares the same network architecture and the same hyper parameters and PPO share these parameters with generator.
It should be mentioned that, if not clearly indicated in the paper, all our parameters about ablation studies are set to the following default values: Hindsight transformation probability , reward function is .
ViB Transfer to realworld robot
The final learned policy is directly transferred from simulated environment to realword UR5 robot without additional training. As show in figure 8, we use different object to define the target position. In our working scenario, RGBD image can be obtained from depth camera installed above the robot. Another trained deep neural network (VGG16) output object pixel position . The target object position under robot coordinate systems can be obtained by the following equation
(8) 
where is camera inner parameter matrix, is depth value with respect to the pixel position , and and are the rotation matrix and transformation vector from the camera coordinate system to the robot coordinate system respectively. At time step , the gripper s position , velocity , target object position p are contacted into a single vector fed into the policy network, which is similar to training of fetch arm in simulated environment. The mean of the output of Gaussian policy is send to robot controller, and UR5 robot gripper moved to the next step position . The above procedure is repeated until the ending of the episode.
Comments
There are no comments yet.