Reinforcement learning (RL) has appeared as a promising method for solving complex decision-making tasks, such as video games, robot manipulation , and autonomous driving . However, devising appropriate reward functions can be quite challenging for many applications . Inverse reinforcement learning (IRL)  addresses the problem of learning reward functions from demonstration data, which is often considered as a branch of imitation learning (IL) . Instead of learning reward functions, other methods of imitation learning were proposed to learn a policy directly from expert demonstrations.
Prior works addressed the IL problems by behavior cloning (BC) which reduces learning a policy from expert demonstrations to supervised learning. However, covariate shift always gives rise to compounding errors . To overcome the drawbacks of BC, generative adversarial imitation learning (GAIL) algorithm  was proposed based on the formulation of generative adversarial networks (GAN) , where the generator is trained to generate expert-like samples and the discriminator is trained to distinguish between generated and real expert samples. GAIL is an appealing approach which is a highly effective and efficient learning framework for policy learning with unknown reward.
Inevitably, demonstration data or usually high-quality demonstration data have to be provided for imitation learning paradigm . However, gathering enough high-quality expert demonstrations in many scenarios is usually costly and difficult. To this end, some methods were proposed to apply imitation learning algorithms training policies by leveraging demonstration data as few as possible even reducing to a single demonstration.
Taking this step further and pursuing an alternative learning paradigm, in this paper, we consider the problem of whether the imitation learning algorithm can be employed successfully without demonstration available, and the final learned policy can show comparable performance to current imitation learning algorithms, which is in high promising.
In order to solve the suggested problem, a feasible way is to make the proposed algorithm intelligently self-synthesize expert-like demonstration data. To do so, we propose hindsight generative adversarial imitation learning (HGAIL) algorithm, which combines the idea of hindsight inspired from psychology  and hindsight experience replay (HER)  with GAIL into a unified learning framework. In the process of adversarial training, as illustrated in Figure 1, rolled-out trajectories are generated by policy G interacting with the environment. Expert-like samples are converted from rolled-out trajectories based on hindsight transformation, where the rolled-out trajectories are directly treated as negative samples without any change, which satisfies the requirements for training the discriminator and the generator. Our experimental results show that the proposed HGAIL allows the agent training to proceed on the rails with no demonstration data provided. The final learned policy shows comparable performance compared with current imitation learning methods.
Furthermore, our HGAIL algorithm essentially endows curriculum learning mechanism in the adversarial learning procedure. At different optimizing iteration steps, expert-like demonstration data are synthesized from different levels rolled-out trajectories data. Therefore, the rolled-out trajectories data and self-synthesized expert-like data are always appropriate for adversarial training, which make the process of adversarial policy learning stable and efficient. As shown in our experiments, the curriculum mechanism is crucial in improving the performance of the final learned policy.
In summary, our main contribution is a method of achieving imitation learning with no demonstration data available. Expert-like demonstration data are self-synthesized with the hindsight transformation mechanism under the proposed HGAIL learning framework. In addition, our method dynamically transforms the rolled-out trajectories data into expert-like data in the training process which ensures the hindsight transformed data is at the appropriate level for adversarial policy learning. To some extent, this latent learning mechanism automatically forms curriculum learning which is of great benefits for improving the performance of the learned policy. Our proposed HGAIL algorithm is also sample-efficiency as we only need rolled-out trajectories data generated by the agent interacting with the environment. No demonstration data or any other external data are required.
Ii Related Work
Ii-a Imitation Learning
Imitation learning algorithms can be classified into three broad categories: behavior cloning (BC), inverse reinforcement learning (IRL), and generative adversarial imitation learning (GAIL).
Inverse reinforcement learning addresses the imitation learning problem by inferring a reward function from demonstration data and then using the learned reward function to train policy. Prior works in IRL includes maximum-margin  and maximum-entropy formulations.
Generative adversarial imitation learning (GAIL)  is a recent imitation learning method inspired by generative adversarial networks (GAN) . Another similar frameworks called guided cost learning (GCL) have also been proposed  for inverse reinforcement learning. As training GAIL is notoriously unstable, lots of works focus on improving stability and robustness by learning semantic policy embeddings  , via kernel mean embedding , or enforcing information bottleneck to constrain information flow in the discriminator . More recent works extends the learning framework by improving on learning robust reward with state only  or with state-action pairs  in transferred setting for new policy learning.
Ii-B Hindsight Experience Replay
Hindsight experience replay (HER)  is proposed for dealing with sparse rewards in reinforcement learning. The key insight of Hindsight Experience Replay (HER) is that even in failed rollouts where no valuable reward was obtained, the agent can transform them into successful ones by assuming that a state it saw in the rollout was the actual goal. Recent works have improved the performance of HER by rewarding hindsight experiences more  , combining curiosity and prioritization mechanism , or calculating trajectories energy based on work-energy in physics . An extension of HER called dynamic hindsight experience replay (DHER)  is proposed to deal with dynamics goals.
Ii-C Learning with Few Data
Generally, training policies with imitation learning method needs expert demonstration data even with huge number or high quality. In some training scenarios, obtaining expert demonstration is not an easy thing . Lots of work has been emerged to achieve the goal of making imitation learning algorithms work well with fewer demonstration data . Such as meta-learning framework , neural task programming , and combining reinforcement with imitation learning . Zero-shot learning is proposed for addressing visual demonstration without action .
In recent works, self-imitation learning methods  are proposed to train policies to reproduce the agent s past good experience without external demonstration provided.  proposes Generative Adversarial Self-imitation Learning (GASIL) method, which encourages the agent to imitate past good trajectories via generative adversarial imitation learning framework.
Instead of choosing the top-K trajectories according to episode return as the positive samples in GASIL , we employs the hindsight idea to directly transform the generator data to expert-like demonstration data. Experiment shows that our proposed method outperforms GASIL in robot’s reaching and grasping target objects scenarios (section IV-A).
To outline our method, we first consider a standard GAIL learning framework consisting of a policy (generator) , and discriminator , parameterized with and respectively. The goal of the policy is to generate rolled-out trajectories similar to demonstration trajectories, and the discriminator is to distinguish between state-action pairs sampled from the expert demonstration trajectories and the generator trajectories. The generator and the discriminator are optimized with the following objective function
where is expert policy, is the casual entropy of policy which encourage the policy sufficiently to explore the action space, and is the regularization weight.
In the concrete implementation of the proposed HGAIL approach, policy and discriminator are represented by multi-layer neural networks. The output
of the policy network parameterizes the Gaussian distribution policy, where is the mean and is the covariance. At the beginning of each episode, our agent sample a goal and an initial state . At time step , the agent take an actions sampled from the Gaussian policy based on current policy , state and goal , where denotes concatenation. Then the agent moves to next state based on transition dynamics , and receives reward given by the discriminator. At the end of each episode, a trajectory sequence is generated, where is the length of trajectory. Repeating above procedure times, we obtain rolled-out trajectories .
In order to train the discriminator without expert demonstrations, our method leveraging the hindsight transformation technique (as shown in Algorithm 1) to convert the rolled-out trajectories into expert-like trajectories .
More specifically, the detail steps for self-synthesizing expert-like trajectories from rolled-out trajectories can be described as follows: firstly, for each trajectory in , we choose each time step with probability for making hindsight transformation, where . All the chosen time steps in for hindsight transformation is appended to . Secondly, for every time step in , we randomly set the new goal of state with the achieved position at state , where is randomly chosen from time step to in . In other words, we randomly set new goal of state with the position achieved after observe state . Then, we succeed in transforming the rolled trajectory into expert-like trajectory . Repeating above procedure until all trajectories are transformed.
The policy (generator) is optimized with policy gradient method proximal policy optimization PPO . The objective function is
The gradient is given by
where is action-value function, is the reward function output from the discriminator.
The discriminator is optimized with the following function via minimizing the cross entropy
The gradient is given by
The fully detailed HGAIL algorithm is shown in Algorithm 2. At the beginning of the training, we generate trajectories using policy with random weights. Expert-like demonstration data is synthesized from with hindsight transformation technique, as is shown in Algorithm 1. is regarded as the negative samples and
is considered as positive samples. We use the maximum likelihood estimation (MLE) method to pre-train policyon . We also pre-train discriminator via minimizing cross entropy between and . We found the pre-train procedure is beneficial for the policy training. After the pre-training procedure, optimizations over policy and discriminator are performed by alternating between policy gradient optimization steps to decrease (2) with respect to policy parameter and gradient step to increase (4) with respect to the discriminator parameter . Finally, the policy and the discriminator are both converged.
Iv Experiments and Results
In this section, our goal is to test whether the policies learned via our proposed HGAIL method works well without external demonstrations provided. In addition, ablation studies are conducted to show the influence of different mechanisms and hyper-parameters on the policy learning. Finally, experiments are carried on to test whether the final learned polies could be directly transfer to real-word physical system.
Iv-a Policy Learning
To test the feasibility of the proposed HGAIL method, experiments are carried out on two common robot s tasks: reaching target position and grasping target object  (as is shown in figure 2) in gym  environment. In order to make these two tasks more challenging, we pretend that only binary sparse reward is available in these two tasks. For reaching task, the reward is -1 for most states, and is 0 only when robot gripper reaching the target position. Similarly, for grasping target object task, the reward is -1 for most states, and is 0 only when robot gripper succeeds in grasping the target object.
We compare our proposed HGAIL algorithm against the following methods: (1) GAIL with demonstrations available, denoted as GAIL-demo, (2) PPO, the state-of-the-art of policy gradient method, (3) GASIL, (4) HGAIL without hindsight transformation technique, denoted as HGAIL-no.
Two tasks implemented on Fetch robot in gym. Fetch robot owns seven degrees of freedom. In our experiments, the robot takes input as four dimensional action vector, the first three elements of action vector moving the end-effector (gripper) to three orthogonal directions, and the fourth one is controlling the gripper to be closed or open.Left: Reaching task. The red point denotes the target position within the robot workspace. The fourth element of action vector is set to be fixed. Right: Grasping object task. The black cube is the target object to be picked. Best viewed in color
The performance of learned policies are measured in terms of two metrics: distance error and success rate. Distance error is measured by the distance between target position and gripper position at the end of each episode.
Success rate specifies the ratio of times successfully reaching target positions within allowed error to all times consumed, , as is shown in equation (5) (for reaching task) or grasping the desiered objects (for grasping task) to all times consumed
where is the indicator function taking true as input and giving 1 as output, and taking false as input and outputting 0.
Implementation details are available in Appendix VI-A. Learning curves on the performance of policies learned with HAGIL compared to above mentioned methods for robot reaching task and grasping tasks are shown in Figure 3 and Table 1 summarizes the performance of the final learned policies.
Compared to GAIL with demonstration available, our HGAIL algorithms shows comparable performance in term of success rates and final distance errors for both reaching task and grasping object task. However, the policies trained with our method for grasping task show slower convergence speed. To some extent, without demonstration data, the performance of policies trained with HGAIL are also promising. Compared to PPO, our algorithm shows much better performance. We can draw a conclusion that, although the component of policy in our algorithm is optimized via PPO, out of our HGAIL learning framework, PPO can’t train successful policies alone for tasks with binary sparse reward. Policies trained with GASIL show slower optimization speed and poorer polices performance in these two tasks. In comparison with HGAIL-no, HGAIL exhibits better performance, which indicates that hindsight transformation is crucial ingredient in our proposed HGAIL algorithm when demonstrations are not available. The results prove that HGAIL can work well without demonstration data available, and learn successful policies. We can also see that, as our algorithm essentially endows curriculum learning mechanism, at the beginning period of policy training, HGAIL shows the faster optimization than GAIL algorithm with demonstrations available.
|Method||Reaching task||Grasping object task|
|success rate||distance error||success rate||distance error|
Iv-B Ablation Studies
In our experiments, ablation studies on reaching task and grasping target object task show the similar conclusions. Consequently, in order to make the content of the paper more concise and compact, by default, we mainly show the results of experiments on reaching task in this section.
Iv-B1 Curriculum Learning or Not
In HGAIL learning framework, hindsight transformed data (expert-like data) is converted from various levels of data rolled out by different-levels’ generator in the procedure of adversarial learning. To some extent, our HGAIL learning paradigm essentially endows training agent policy with curriculum learning mechanism.
In order to show whether this curriculum learning mechanism is crucial for policy training, experiments on policies trained without curriculum learning mechanism is conducted. Concretely, in the ablation experiments, hindsight transformed data (expert-like data) are transformed only from rolled-out trajectories produced by the policy at the beginning of training period. Learning curves on success rates and distance errors are shown in Figure 4 and Table 2 summaries the performance of the final trained policies. As illustrated, policy trained with curriculum learning mechanism shows the better performance with respect to both success rate and distance error, and the learning process is more stable.
|Method||Success rates||Distance errors ()|
|No curriculum learning|
Iv-B2 Formation of Hindsight Transformation
Inspired from HER , we propose two different strategies for hindsight transformation called final hindsight transformation and future hindsight transformation respectively. Final hindsight transformation replaces the goal of each state with the position of the final reached state in its own episode. However, Future hindsight transformation is randomly changing target position of each state with the position of state observed after it, as shown in Algorithm 1.
Learning curves in terms of two different hindsight transformation are shown in Figure 5 and Table 3 summarizes the final policies performance. Final hindsight transformation can t work well and the policy learned with it divergent gradually in the training procedure.
|Hindsight transformation||Success rates||Distance errors ()|
Iv-B3 Hindsight Transformation Probability
So far, the performance of all the learned policies were trained with hindsight transformation probability . We also interested in the effect of value on the performance of the final learned policy. Experiments are carried out with being 0.2, 0.4, 0.6, 0.8 and 1. Learning curves are shown in Figure 6. The results illustrates that converting each state into hindsight transformation with probability 1 performs best, which is different from HER . The bigger value of the hindsight transformation probability is, the better performance the final learned policy demonstrates.
Iv-B4 Reward Formation
Different reward function for policy learning in HGAIL learning framework is experimentally analyzed. In the GAIL learning framework, different reward formations for has been applied . We compare four common reward functions written as , , , , where
denotes sigmoid function,is the output of discriminator taking state and action pairs as input, and is clipping to . The results are illustrated in Figure 7. The policies learned from reward converged fastest compared to other three reward functions. , , and guides the final learned policies exhibits similar better performance in terms of distance error in contrast to . The policies learned from show the best performance with respect to not only in iteration steps consumed for policy training, but also in higher success rates and lower distance errors. As a result, in our work, we choose as our default reward function for policy learning.
Iv-C Sim to Real Policy Transfer
To validate the feasibility of the policy trained with our algorithm deployed in real-world physical system (no additional training). Experiment are conducted on real-world UR5 robot (the only robot arm available in our lab). The detail implementation of experiments is shown in Appendix VI-B. As is shown in Figure 8, the position of red ball is the target position for reaching task, and The pink cube is the target object to be grasped for grasping object task. Frames of UR5 robot employing learned policy in reaching target position and grasping target object are pictured respectively, as is shown in Figure 8. Success rates and distance errors are summarized in Table 4. Results show that policy learned with HGAIL can successfully transfer from simulated environment to real-word scenarios, and the performance in real-world scenarios is consistent with simulated environment without additional training.
|Task||Success rates||Distance errors ()|
We propose HGAIL algorithm which is a new learning paradigm under GAIL learning framework for learning control policy without expert demonstration available. We adopt hindsight transformation mechanism to self- synthesize expert-like demonstration data for adversarial policy learning. Experimental results show that the proposed method efficiently and effectively trains policies. In addition, hindsight transition technique essentially endowing curriculum learning mechanism under our learning framework is critical for policy learning. We also validate the feasibility of the policy trained with our algorithm directly deployed in real-world robot without additional training.
In the future, we want to employ our method in more continuous and discrete environments. A promising line is directly applying our method in training manipulation skills on real-world robot, as the amount of training interaction data is relatively small. Another exciting direction is to combine the HGAIL algorithm with hierarchy to solve more complicated tasks.
-  Mnih V, Kavukcuoglu K, Silver D, et al, Human-level control through deep reinforcement learning, Nature, vol. 518, no. 7540, pp. 529533, 2015.
-  Andrychowicz M, Baker B, Chociej M, et al. Learning dexterous in-hand manipulation, arXiv preprint arXiv: 1808.00177, 2018.
Levine S, Finn C, Darrell T, et al. End-to-end training of deep visuomotor policies[J]. The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334-1373, 2016.
-  Kuderer M, Gulati S, Burgard W. Learning driving styles for autonomous vehicles from demonstration, IEEE International Conference on Robotics and Automation. 2015: 2641-2646.
-  Qureshi A H, Yip M C. Adversarial Imitation via Variational Inverse Reinforcement Learning, International Conference on Learning Representations, 2019.
-  Ng A, Russell S. Algorithms for inverse reinforcement learning, International Conference on Machine Learning. 2000: 663-670.
-  Argall B D, Chernova S, Veloso M, et al. A survey of robot learning from demonstration, Robotics and autonomous systems, vol. 57, no. 5, pp. 469-483, 2009.
-  Nejati N, Langley P, Konik T. Learning hierarchical task networks by observation, International Conference on Machine Learning. 2006: 665-672.
Ross S, Gordon G, Bagnell D. A reduction of imitation learning and structured prediction to no-regret online learning, In Proceedings of the fourteenth International Conference on Artificial Intelligence and Statistics, 2011: 627-635.
-  Ho J, Ermon S. Generative adversarial imitation learning, In Advances in Neural Information Processing Systems. 2016: 4565-4573.
-  Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets, In Advances in Neural Information Processing Systems. 2014: 2672-2680.
-  Zhang T, McCarthy Z, Jowl O, et al. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation, IEEE International Conference on Robotics and Automation. 2018: 5628-5635.
-  Duan Y, Andrychowicz M, Stadie B, et al. One-shot imitation learning, In Advances in Neural Information Processing Systems. 2017: 1087-1098.
-  Fischhoff B. Hindsight is not equal to foresight: The effect of outcome knowledge on judgment under uncertainty[J]. Journal of Experimental Psychology: Human perception and performance, vol. 1, no. 3, pp. 288, 1975.
-  Andrychowicz M, Wolski F, Ray A, et al. Hindsight experience replay, In Advances in Neural Information Processing Systems. 2017: 5048-5058.
-  Abbeel P, Ng A. Apprenticeship learning via inverse reinforcement learning, International Conference on Machine Learning. 2004: 1.
-  Ratliff N D, Bagnell J A, Zinkevich M A. Maximum margin planning, International Conference on Machine Learning. 2006: 729-736.
-  Ziebart B D, Maas A L, Bagnell J A, et al. Maximum entropy inverse reinforcement learning, In Twenty-Third AAAI Conference on Artificial Intelligence. 2008: 1433-1438.
-  Bloem M, Bambos N. Infinite time horizon maximum causal entropy inverse reinforcement learning, IEEE Conference on Decision and Control. 2014: 4911-4916.
-  Boularias A, Kober J, Peters J. Relative entropy inverse reinforcement learning, In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 2011: 182-189.
-  Finn C, Levine S, Abbeel P. Guided cost learning: Deep inverse optimal control via policy optimization, International Conference on Machine Learning. 2016: 49-58.
-  Wang Z, Merel J S, Reed S E, et al. Robust imitation of diverse behaviors, In Advances in Neural Information Processing Systems. 2017: 5320-5329.
-  Kim K E, Park H S. Imitation learning via kernel mean embedding, In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
-  Peng X B, Kanazawa A, Toyer S, et al. Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow, arXiv preprint arXiv: 1810.00821, 2018.
-  Fu J, Luo K, Levine S. Learning robust rewards with adversarial inverse reinforcement learning, International Conference on Learning Representations, 2017.
-  Qureshi A H, Boots B, Yip M C. Adversarial imitation via variational inverse reinforcement learning, International Conference on Learning Representations, 2019.
-  Torabi F, Warnell G, Stone P. Generative adversarial imitation from observation, arXiv preprint arXiv: 1807.06158, 2018.
-  Li Y, Song J, Ermon S. InfoGAIL: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems. 2017: 3812-3822.
-  Lanka S, Wu T. ARCHER: Aggressive rewards to counter bias in hindsight experience replay, arXiv preprint arXiv: 1809.02070, 2018.
-  Zhao R, Tresp V. Curiosity-driven experience prioritization via density estimation, arXiv preprint arXiv: 1902.08039, 2019.
-  Zhao R, Tresp V. Energy-based hindsight experience prioritization, In 2nd Conference on Robot Learning, 2018.
-  Fang M, Zhou C, Shi B, et al. DHER: Hindsight experience replay for dynamic goals, International Conference on Learning Representations, 2019.
-  Finn C, Yu T, Zhang T, et al. One-shot visual imitation learning via meta-learning, In 1st Conference on Robot Learning. 2017: 357-368.
-  Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks, In Advances in Neural Information Processing Systems. 2017: 1126-1135.
-  Yu T, Finn C, Xie A, et al. One-shot imitation from observing humans via domain-adaptive meta-learning, arXiv preprint arXiv: 1802.01557, 2018.
-  Finn C, Yu T, Zhang T, et al. One-Shot Visual Imitation Learning via Meta-Learning, In Conference on Robot Learning. 2017: 357-368.
-  Xu D, Nair S, Zhu Y, et al. Neural task programming: Learning to generalize across hierarchical tasks, IEEE International Conference on Robotics and Automation. 2018: 3795-3802.
-  Wang Z, Merel J S, Reed S E, et al. Robust imitation of diverse behaviors, In Advances in Neural Information Processing Systems. 2017: 5320-5329.
-  Zhu Y, Wang Z, Merel J S, et al. Reinforcement and imitation learning for diverse visuomotor skills, arXiv preprint arXiv: 1802.09564, 2018.
-  Oh J, Guo Y, Singh S, et al. Self-imitation learning[J]. Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018.
-  Gangwani T, Liu Q, Peng J. Learning self-imitating diverse policies, International Conference on Learning Representations, 2019.
-  Guo Y, Oh J, Singh S, et al. Generative adversarial self-imitation learning, arXiv preprint arXiv: 1812.00950, 2018.
-  Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms, arXiv preprint arXiv: 1707.06347, 2017.
-  Plappert M, Andrychowicz M, Ray A, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research, arXiv preprint arXiv: 1802.09464, 2018.
-  Brockman G, Cheung V, Pettersson L, et al. Openai gym, arXiv preprint arXiv: 1606.01540, 2016.
-  Schulman J, Levine S, Abbeel P, et al. Trust Region Policy Optimization, International Conference on Machine Learning. 2015: 1889-1897.
-  Hester T, Vecerik M, Pietquin O, et al. Deep Q-learning from demonstrations, In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
-  Che T, Li Y, Jacob A P, et al. Mode regularized generative adversarial networks, International Conference on Learning Representations, 2017.
Vi-a Implementation Details
In this section, we provide additional details about the experimental tasks setup and hyper-parameters.
We use two layer tanh neural network with 64 units for the value network and policy network. The policy network take input as a concatenated vector with gripper position, gripper velocity, and target position. The policy network s out parameterizes the Gaussian policy distribution, where the mean is the output of the policy network and the fixed covariance was set to be 1.
We use two-layer tanh neural network with 100 units in each layer for the discriminator .
We set , . Learning rate for discriminator is 0.0004, and learning rate for generator is . Batch size is set to be 64 for discriminator optimization, and 128 for generator optimization. The pre-train steps for generator is 100 and for discriminator is 500. For fair comparison, all experiments were run in a single thread, all of the algorithms( HGAIL, GAIL-demo, GASIL, and HGAIL-no) shares the same network architecture and the same hyper parameters and PPO share these parameters with generator.
It should be mentioned that, if not clearly indicated in the paper, all our parameters about ablation studies are set to the following default values: Hindsight transformation probability , reward function is .
Vi-B Transfer to real-world robot
The final learned policy is directly transferred from simulated environment to real-word UR5 robot without additional training. As show in figure 8, we use different object to define the target position. In our working scenario, RGB-D image can be obtained from depth camera installed above the robot. Another trained deep neural network (VGG-16) output object pixel position . The target object position under robot coordinate systems can be obtained by the following equation
where is camera inner parameter matrix, is depth value with respect to the pixel position , and and are the rotation matrix and transformation vector from the camera coordinate system to the robot coordinate system respectively. At time step , the gripper s position , velocity , target object position p are contacted into a single vector fed into the policy network, which is similar to training of fetch arm in simulated environment. The mean of the output of Gaussian policy is send to robot controller, and UR5 robot gripper moved to the next step position . The above procedure is repeated until the ending of the episode.