In this work we consider visually guided robotics manipulations and aim to learn visuomotor control policies to solve particular tasks. One approach to address this problem is imitation learning (IL) [1, 2, 3, 4] that aims to mimic sequences of actions from expert demonstrations. Such a supervised approach is efficient for learning short and simple tasks of limited variability. One drawback of IL is its difficulty to handle new states that have not been observed during demonstrations. While increasing the number of demonstrations helps to alleviate this issue, an exhaustive sampling of action sequences and scenarios becomes impractical for long and complex tasks.
Reinforcement learning (RL) is a complementary approach to IL that requires little supervision and achieves excellent results for some challenging tasks [5, 6]. RL explores previously unseen scenarios and, hence, can generalize beyond expert demonstrations. As full exploration is exponentially hard and becomes impractical for problems with long horizons, RL often relies on careful engineering of intermediate reward functions designed for specific tasks.
Common tasks such as preparing food or assembling furniture require long sequences of steps composed of many different actions. Such tasks have long horizons and, hence, are difficult to solve both with RL and IL methods. To address this issue, we propose a hierarchy of RL and imitation-based skills. Our approach aims to simplify RL by reducing its exploration to sequences with a limited number of primitive actions, that we call skills.
Given a set of pre-trained skills such as "grasp a cube" or "pour from a cup", we train RL with sparse binary rewards corresponding to the correct/incorrect execution of the full task. While hierarchical policies have been proposed in the past [7, 8], our approach can learn composite manipulations using no intermediate rewards and no demonstrations of full tasks. Given such properties, the proposed method can be directly applied to learn new tasks. See Figure 1 for the overview of our approach.
Our skills are low-level visuomotor controllers learned from synthetic demonstrated trajectories with behavioral cloning (BC) . Examples of skills include go to the bowl, grasp the object, pour from the held object, release the held object, etc.
We automatically generate expert synthetic demonstrations and learn corresponding skills in simulated environments. We also minimize the number of required demonstrations by choosing appropriate CNN architectures and data augmentation methods. Our approach is shown to compare favorably to the state of the art  on the FetchPickPlace test environment . Moreover, using recent techniques for domain adaptation  we demonstrate high success rates for task execution on a real robot while training all policies in a simulator.
In summary, this work makes the following contributions. (i) We propose a new hierarchical combination of RL policies and IL skills to address composite tasks. (ii) We present sample efficient training of BC skills and demonstrate an improvement compared to the state of the art. (iii) We demonstrate successful learning of relatively complex manipulation tasks without neither intermediate rewards nor full demonstrations. (iv) We execute policies learned in simulation to solve multi-step tasks successfully on a real robot. Our simulation environments together with the code and models used in this work will become publicly available.
Ii Related work
Our work is related to robotics manipulation such as grasping , opening doors , screwing the cap of a bottle  and cube stacking . Such tasks have been addressed by various methods including imitation learning (IL)  and reinforcement learning (RL) .
Imitation Learning learns to solve a task by observing demonstrations. Approaches include behavioral cloning (BC)  and inverse reinforcement learning . BC learns a function that maps states to expert actions [2, 3], whereas inverse reinforcement learning learns a reward function from demonstrations in order to solve the task with RL [17, 18, 14]. BC typically requires a large number of demonstrations and has issues with solving long-horizon problems. While these problems might be solved with additional expert supervision  or noise injection in expert demonstrations , we address them by improving the standard BC framework. We use recent state-of-the-art CNN architectures and data augmentation for expert trajectories. Combining these allows to significantly reduce the number of required demonstrations and to improve performance.
Reinforcement Learning learns to solve a tasks without demonstrations using exploration. Despite impressive results in several domains [6, 5, 20, 12], RL methods show limited capabilities when operating in long-horizon and sparse-reward environments common in robotics. Moreover, RL methods typically require prohibitively large amounts of interactions with the environment during training. Hierarchical RL (HRL) methods alleviate some of these problems by learning a high-level policy modulating low-level workers. HRL approaches are generally based either on options  or a feudal framework . The option methods learn a master policy that switches between separate skill policies [23, 24, 25, 26]. The feudal approaches learn a master policy that modulates a low-level policy by a control signal [27, 28, 29, 30, 31]. Our approach is based on options but in contrast to the cited methods, we pretrain the skills with IL. This allows us to solve long-horizon and sparse reward problems using significantly less interactions with the environment during training.
Combinations of RL and IL have been introduced recently. Gao et al. use demonstrations to initialize the RL agent. [33, 34] use RL to improve expert demonstrations, but do not learn hierarchical policies. Demonstrations have been also used to define RL objective functions [35, 36] and rewards . Das et al.  combines IL and RL to learn a hierarchical policy. Unlike our method, however,  requires full task demonstrations and task-specific reward engineering. Moreover, the addressed navigation problem in  has a much lower time horizon compared to our tasks.  also relies on pre-trained CNN representations which limits its application domain. Le at al.  train low-level skills with RL, while using demonstrations to switch between skills. In a reverse manner, we use IL to learn low-level control and then deploy RL to find appropriate sequences of pre-trained skills. The advantage is that our method can learn a variety of long-horizon manipulations without full task demonstrations. Moreover, [7, 8] learn discrete actions and cannot be directly applied to robotics manipulations that require continous control.
In summary, none of the methods [7, 8, 33, 34] is directly suitable for learning long-horizon robotic manipulations due to requirements of dense rewards [7, 34] and state inputs [33, 34], limitations to short horizons and discrete actions [7, 8], the requirement of full task demonstrations [7, 8, 33, 34] and the lack of learning of visual representations [7, 33, 34]. Moreover, our skills learned from synthetic demonstrated trajectories outperform RL based methods, see Section IV-F.
Our HRL-BC approach aims to learn multi-step policies by combining hierarchical reinforcement learning (HRL) and pre-trained skills obtained with behavioral cloning (BC). We present BC and HRL-BC in Sections III-A and III-B. Implementation details are given in Section III-C.
Iii-a Skill learning with behavioral cloning
Our first goal is to learn basic skills that can be composed into more complex policies. Given observation-action pairs along expert trajectories, we follow the behavioral cloning approach  and learn a function approximating the conditional distribution of the expert policy controlling a robot arm. Our observations are sequences of the last depth frames. Actions , are defined by the end-effector linear velocity and angular velocity as well as the gripper openness state .
We learn the deterministic skill policies approximating the expert policy . Given observations with corresponding expert (ground truth) actions , we represent
with a convolutional neural network (CNN) and learn network parameterssuch that predicted actions minimize the loss
where is a scaling factor which we set to 0.9 (chosen empirically).
Our network architecture is presented in Figure 1(right). When training several skill policies such as reaching, grasping or pouring, we share parameters of the and add a separate branch with convolutional layers for each skill . We use the same architecture to learn the master policy as explained in Section III-B.
Iii-B HRL-BC approach
We wish to solve composite manipulations without full expert demonstrations and with a single sparse reward. For this purpose we design high-level master policies controlling the pre-trained skill policies at a coarse timescale. To learn , we follow the standard formulation of reinforcement learning and maximize the expected return given rewards . Our reward function is sparse and returns upon successful termination of the task and otherwise. The RL master policy chooses one of the skill policies to execute the low-level control, i.e., the action space of is discrete: . Note, that our sparse reward function makes the learning of deep visual representations challenging. We therefore train using visual features obtained from the BC pre-trained as illustrated in Figure 1(right).
To solve composite tasks with sparse rewards, we use a coarse timescale for the master policy. The selected skill policy controls the robot for consecutive time-steps before the master policy is activated again to choose a new skill. This allows the master to focus on high-level decisions rather than low-level control. At the same time, we expect the master policy to recover from unexpected events, for example, if an object slips out of a gripper, by activating an appropriate skill policy. Our hierarchical combination of the master and skill policies is illustrated in Figure 1(left).
The pseudo-code of the proposed approach is shown in Algorithm 1. The algorithm can be divided into three main steps. First, we collect a dataset of expert trajectories for each skill policy . For each policy, we use an expert script that has an access to the full state of the environment. Next, we train a set of skill policies . We sample a batch of state-action pairs and update parameters of convolutional layers as well as skill-specific parameters for skills . Finally, we learn the master using the pretrained skill policies and the frozen parameters . We collect episode rollouts by first choosing a skill policy with the master and then applying the selected skill to the environment for time-steps (lines 13-14). We update the master policy weights by maximizing the expected sum of rewards.
Iii-C Implementation details
Skill learning with BC. While training the skills, we predict several actions in the future to improve the policy performance and to stabilize the training. We learn to predict actions at times during training and use action at time
to control the robot at test time. We normalize the ground truth of the expert actions to have zero mean and a unit variance and normalize the depth values of input frames to be in. For the optimization, we use Adam  with the learning rate of
and a batch size 64. In all cases we use Batch Normalization.
Task learning with RL. We use PPO  as an off-the-shelf RL optimization algorithm for the master policy. We use an open-source implementation from  where we set the entropy coefficient to 0.05 and the value loss coefficient to 1. For the two environments considered in Section V
, we use 8 episode rollouts for the PPO update. The rest of the PPO hyperparameters are set to the defaults of the open-source implementation. For the HRL-BC method, the master and skill policies have CNN layers with parameterson top of the common . During pre-training of skill policies we update the parameters . When training the master policy, we only update while keeping all other parameters fixed.
Iv Evaluation of BC skill learning
Iv-a Experimental setup
Robot and agent environment. Existing environments used for evaluating learning methods in a robotics scenario  are limited. Here, we design a set of tasks with the pybullet physics simulator . Our environment models a 6-DoF UR5 robotic arm and a 3 finger Robotiq gripper. The agent observes the environment from a depth camera in front of the arm, takes as input the three last depth frames and commands the robot with an action . The control is performed at a frequency of 10 Hz.
UR5 tasks. For the skill learning we rely on the tasks of picking up a cube (UR5-Pick) and pouring from a bottle into a bowl (UR5-Pour). UR5-Pick task picks up a cube of a size between 3.5 cm and 8.0 cm and lifts it above 7 cm, see Figure 2(a). The maximum episode length is 200 steps. UR5-Pour task has as goal to pour from a filled bottle into an empty bowl, see Figure 2(b). The maximum episode length is 400 steps. In each episode, the bottle and the bowl are randomly chosen from the ShapeNet dataset . We use distinct object instances for the training and test sets.
Synthetic dataset. We use our simulator environment to create a synthetic training and test set. For all our experiments, we collect expert synthetic demonstrated trajectories with random initial configurations where the objects and the end-effector are allocated within a workspace of . The synthetic demonstrations are collected using an expert script designed for each skill. The script has access to the full state of the system including the states of the robot and the objects. To generate synthetic demonstrations, we program end-effector trajectories and use inverse kinematics (IK) to generate corresponding trajectories in a robot joints space. For example, to pick up a cube, we draw a line from the robot initial position to a position above the cube, go down, close the gripper and finally go up. We collect up to training demonstrations in our simulator. Each demonstration consists of multiple pairs of the three last camera observations and the robot control command performed by the expert script. For the experiments with the UR5 tasks, we use a reference viewpoint located at 140 cm of the robot base, with a pitch angle.
Training and evaluation.
For each dataset, we train the policy for 200 epochs. During training, we evaluate the policy every 4 epochs by running it on a set of 100 validation configurations and measure its success rate. We pick the policy with the best success rate on the validation set. For testing we use 50 new configurations and report the success rate of completing the task correctly for these 50 configurations.
Iv-B CNN architecture for BC skill learning
Given the synthetic training set of the UR5-Pick task, see Figure 2(a) (left), we train policies with different CNN architectures. Table I compares the success rates of learned policies for different CNN architectures and varying number of expert demonstrations on the synthetic test set. For BC skill policies, we here directly predict the action from the output of .
Policies based on the VGG CNN architecture  obtain success rate below with training demonstrations and reach with demonstrations.
ResNet  based policies have a success rate above when trained on a dataset of demonstrations and reach with demonstrations. Overall ResNet-101 has the best performance closely followed by ResNet-18 and outperforms VGG significantly. To conclude, we find that the network architecture has a fundamental impact on the BC performance. In the following experiments we use ResNet-18 as it presents a good trade-off between performance and training time.
When examining why VGG-based BC has a lower success rate, we observe that it has higher validation errors compared to ResNet. This indicates that VGG performs worse on the level of individual steps and is hence expected to result in higher compounding errors. We find that architectures based on ResNet suffer from the same problem only when a small number of demonstrations is used (less than 100).
Iv-C Evaluation of data augmentation
We evaluate the impact of different types of data augmentations in Table II. We compare training without data augmentation with 3 variants: (1) random translations, rotations and crops (standard approach for object detection), (2) record each each expert synthetic demonstration from 10 varying viewpoints and (3) the combination of (1) and (2). We sample the camera positions on a section of a sphere centered on the robot and with a radius of . We uniformly sample the yaw angle in , the pitch angle in , and the distance to the robot base in m.
Success rates for UR5-Pick on datasets with 20, 50 and 100 demonstrations are reported in Table II. We observe that data augmentation is particularly important when only a few demonstrations are available. For 20 demonstrations, the policy trained with no augmentation performs at 1% while the policy trained with standard and viewpoint augmentations together performs at 75%. The standard data augmentation brings more improvement than the viewpoint
augmentation for a small numbers of demonstrations (20 and 50). However it does not reach 100% with 100 demonstrations in contrast to multiple viewpoints. The policy trained with a combination of both augmentation types performs the best and achieves 93% and 100% success rate for 50 and 100 demonstrations respectively. In summary, data augmentation allows a significant reduction in the number of expert trajectories required to solve the task.
Iv-D Evaluation of instance generalization
We evaluate how BC skills generalize from specific instances to object categories on the UR5-Pour task shown in Figure 2 (left). In the following experiments we use ResNet-18 trained with 200 synthetic demonstrations and the combination of standard and viewpoint data augmentations. The evaluation in our simulation environment compares the use of 1, 10 or 50 bottle instances during training. We use 50 different bottles to test the learned policies. All bottles are obtained from the ShapeNet dataset .
When trained with a single bottle, the policy achieves success rate of 46% and does not generalize to the test bottles well. If trained on 10 bottles the success rate is 84%. Further increasing the train set size to 50 bottles does not improve the results. This can be explained by the fact that the additional bottles do not add information, as they are similar to the initial set.
Iv-E Real robot experiments
We evaluate trained policies on a real robot, which has the same robotic arm and gripper as in simulation. We record depth images with Microsoft Kinect 2 placed as in simulation. In order to apply the method on the real robot, we use a state-of-the-art technique of learning sim2real based on data augmentation with domain randomization . This method uses a proxy task of cube position prediction and a set of basic image transformations to learn a sim2real data augmentation function for depth images. We augment the depth frames from synthetic expert demonstrations with this method and, then, train skill policies. Once the skill policy is trained on these augmented simulation images, it is directly used on the real robot.
We evaluate our method on UR5-Pick and UR5-Pour described in Section IV-A, see Figure 2. We use 20 different initial configurations for the real world evaluation. We show that our approach when trained with sim2real data augmentation transfers well to the real robot. The learned policy manages to pick up cubes of sizes 3.5, 4.7 and 8 cm correctly for 20 out of 20 trials. The task pour bottle is slightly more difficult and 17 out of 20 trials are successful. For evaluation, we use 3 real bottles which shapes were not present in the simulation training set.
Iv-F Comparison with state-of-the-art methods
One of the few test-beds for robot manipulation is FetchPickPlace from OpenAI Gym  implemented in mujoco , see Figure 3(a). The goal for the agent is to pick up a cube and to move it to the red target. The agent observes the three last RGB-D images from a camera placed in front of the robot . The positions of the cube and the target are set at random for each trial. The reward of the task is a single sparse reward if the cube is within an ball of the target (depicted in red in Figure 3(a)). The maximum length of the task is set to 50 time-steps.
For a fair comparison with , we follow the same procedure for the dataset recording and do not use any data augmentation. Instead, we record each synthetic demonstration from a viewpoint picked at random using the same sampling space described in Section IV-C. We train a ResNet-18 policy using from to expert demonstrations. The results are reported in Figure 3(b) where we test our approach on 200 different initial configurations. We follow  and plot the success rate of both RL and IL methods with respect to the number of episodes used (either trials or synthetic demonstrations).
The success rate indicates that we outperform the policies trained with an imitation learning method DAgger  in terms of performance and RL methods such as HER  and DDPG  in terms of data-efficiency. According to the reported results, DAgger does not reach 100% even after demonstrations while, unlike our method, requires an expert during training. HER reaches the success rate of 100% but requires about episodes. Our approach achieves the 96% success rate using demonstrations.
V Evaluation of HRL-BC
This section evaluates the proposed HRL-BC approach of learning an RL master policy on top of pretrained BC skills. We evaluate the method on two tasks described in Section V-A: UR5-Bowl and UR5-Breakfast illustrated in Figure 4. Both tasks are long and are trained with a single sparse reward of success. Section V-B compares our HRL-BC approach with a strong "BC-ordered" baseline where the skills are executed in a fixed and pre-defined order. We report results for simulated and real versions of both tasks. Note, that our real robot experiments are performed with skills and master policies that have been trained exclusively in simulation.
V-a Experimental setup
HRL-BC tasks. To evaluate HRL-BC we consider two UR5 tasks. The UR5-Bowl task has as goal to grasp the cube and to place it in the bowl as shown in Figure 4(a). The maximum episode length is 600. The UR5-Breakfast task contains a cup, a bottle and a bowl as shown in Figure 4(b). The agent needs to pour ingredients from the cup and the bottle in the bowl; the reward is positive if and only if all ingredients are in the bowl. The maximum episode length is 2000.
Skills and datasets. For each task we consider a set of skills defined by expert scripts. For the UR5-Bowl task, we define four skills: (a) go to the the cube, (b) go down and grasp, (c) go up, and (d) go to the bowl and open the gripper. To train BC skill policies, we collect a dataset of 250 synthetic demonstrated trajectories for each skill recorded from 5 viewpoints.
For the UR5-Breakfast task, we define four skills: (a) go to the bottle, (b) go to the cup, (c) grasp an object and pour it to the bowl, and (d) release the held object. We collect a dataset with 250 demonstrations containing cup pouring, bottle pouring and object placement in random locations. Synthetic demonstrations are rendered in 5 viewpoints and contain different bottles and cups from ShapeNet. We emphasize that the expert dataset does not contain full task demonstrations and that all our training is done in simulation.
Training. Similar to Section IV, we learn BC skills to mimic synthetic demonstrations generated by the expert script. Unlike BC skill learning in Section IV, we here train all skills of the same task simultaneously using the multi-head CNN architecture illustrated in Figure 1(right). We represent each head by a convolutional block from ResNet-18. Each head output is then averaged with an average pooling and used to predict the skill action using a linear layer. When training the RL master, we execute selected skills for 60 consecutive time-steps for the UR5-Bowl task and 220 time-steps for the UR5-Breakfast task. The master policy keeps switching the skills until either the maximum length of the episode or the success condition is reached. We train HRL-BC using 8 different random seeds in parallel. The rest of RL hyperparameters are described in Section III-C. During the RL and BC skill training we apply sim2real data augmentation  to enable successful execution of policies on a real robot.
We evaluate HRL-BC and compare it to a strong baseline with a fixed sequence of BC skills following the manually pre-defined correct order (BC-ordered). Our results for the simulated "UR5-Bowl (sim)" and real "UR5-Bowl (real)" environments are reported in Table III. While tested in simulation, both HRL-BC and the BC-ordered put the cube into the bowl with a success rate of 96%. HRL-BC manages to learn the sequence of skills given only sparse rewards.
We have also attempted to solve the UR5-Bowl task without skills by learning an RL policy performing low-level control. We initialized ResNet-18 on ImageNet, froze it and trained the same architecture used for the HRL-BC master policy
(3 convolutional layers with 64 filters of size ) on top of the visual representation with PPO. Whereas PPO did not solve the task a single time after episodes, HRL-BC reaches 96% after 400 episodes.
We evaluate UR5-Bowl on the real robot as illustrated in Figure 4(a)(right). While the ordered skills solve the task with a success rate 17/20, HRL-BC succeeds in 18 out of 20 episodes. While the ordered skills use a fixed number of time-steps to execute actions, the master policy can determine how close the cube is to the gripper and can recover from failures.
We evaluate HRL-BC on the challenging UR5-Breakfast task both in simulation and on the real robot. Results are reported in Table III. In particular, we define two setups with different distances between initial positions of the objects: UR5-Breakfast-Simple and UR5-Breakfast-Hard. In the hard setup, the bottle and the cup are placed close to each other and therefore the order of grasp becomes very important. Attempts to grasp wrong objects in this setup result in failures due to collisions of the robot with another object. While training HRL-BC, we use the same BC-skills for both setup. When tested in simulation, HRL-BC performs similar to BC-ordered on UR5-Breakfast-Simple (90% of successful trials) but outperforms the manual ordering on UR5-Breakfast-Hard (60% for HRL-BC vs. 42% for BC-ordered). While the ordered skills always grasp the predefined object, the HRL-BC can choose an object it should grasp first which is especially important on UR5-Breakfast-Hard. In the real world evaluation, both HRL-BC and ordered skills succeeded in 16 out of 20 episodes on the simple setup. However, on the hard setup the performance of ordered skills drops to 45% due to collisions. HRL-BC succeeds in 60% of trials by choosing the appropriate object to avoid collisions. Some qualitative results for HRL-BC are shown in Figure 4. The appendix presents more qualitative results. In particular, we demonstrate successful execution of HRL-BC while the robot is facing the challenges of previously unseen objects, dynamic scene changes and occlusions.
This paper introduces an approach for hierarchical RL with skills learned from synthetic demonstrated trajectories. Our method is well adapted for composite problems and requires no full-task demonstrations nor intermediate rewards. We show excellent results in simulation and on a real robot. Given our sample-efficient strategy for learning primitive skills, the proposed method should generalize to a variety of new tasks. Future work includes learning multiple tasks with shared skills, determining new skills automatically and addressing contact-rich tasks.
-  D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in NIPS, 1989.
-  S. Ross and J. A. Bagnell, “Reinforcement and imitation learning via interactive no-regret learning,” in arXiv, 2014. [Online]. Available: http://arxiv.org/abs/1406.5979
-  L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric Actor Critic for Image-Based Robot Learning,” RSS, 2018.
-  A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” in ICML, 2000.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, 2015. [Online]. Available: http://www.nature.com/doifinder/10.1038/nature14236
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, 2016.
-  A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Neural modular control for embodied question answering,” in CoRL, 2018.
-  H. M. Le, N. Jiang, A. Agarwal, M. Dudík, Y. Yue, and H. D. III, “Hierarchical imitation and reinforcement learning,” in ICML, 2018.
-  M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba, “Multi-goal reinforcement learning: Challenging robotics environments and request for research,” arXiv, 2018. [Online]. Available: http://arxiv.org/abs/1802.09464
-  A. Pashevich, R. A. M. Strudel, I. Kalevatykh, I. Laptev, and C. Schmid, “Learning to augment synthetic images for sim2real policy transfer,” IROS, 2019.
-  T. Lampe and M. Riedmiller, “Acquiring visual servoing reaching and grasping skills using neural reinforcement learning,” IJCNN, 2013.
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation,” in ICML, 2016. [Online]. Available: http://arxiv.org/abs/1610.00633
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” JMLR, 2015. [Online]. Available: http://arxiv.org/abs/1504.00702
-  I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller, “Data-efficient Deep Reinforcement Learning for Dexterous Manipulation,” arXiv, 2017. [Online]. Available: http://arxiv.org/abs/1704.03073
-  Y. Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot imitation learning,” NIPS, 2017.
-  M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing - Solving sparse reward tasks from scratch,” MLR, 2018. [Online]. Available: http://arxiv.org/abs/1802.10567
-  J. Ho and S. Ermon, “Generative adversarial imitation learning,” in NIPS, 2016.
-  V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous manipulation policies from experience and imitation,” arXiv, 2016. [Online]. Available: http://arxiv.org/abs/1611.05095
-  M. Laskey, J. Lee, R. Fox, A. D. Dragan, and K. Goldberg, “DART: Noise injection for robust imitation learning,” in CoRL, 2017. [Online]. Available: http://proceedings.mlr.press/v78/laskey17a.html
-  J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” IJRR, vol. 32, no. 11, 2013.
-  R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
-  P. Dayan and G. Hinton, “Feudal Reinforcement Learning,” in NIPS, 1993.
-  K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman, “Meta learning shared hierarchies,” ICLR, 2018. [Online]. Available: http://arxiv.org/abs/1710.09767
-  Y. Lee, S.-H. Sun, S. Somasundaram, E. S. Hu, and J. J. Lim, “Composing complex skills by learning transition policies with proximity reward induction,” in ICLR, 2019.
-  P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture.” in AAAI, 2017.
-  C. Florensa, Y. Duan, and P. Abbeel, “Stochastic neural networks for hierarchical reinforcement learning,” ICLR, 2017. [Online]. Available: http://arxiv.org/abs/1704.03012
-  T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine, “Latent space policies for hierarchical reinforcement learning,” ICML, 2018. [Online]. Available: http://arxiv.org/abs/1804.02808
-  O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” NIPS, 2018. [Online]. Available: http://arxiv.org/abs/1805.08296
-  A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” in ICML, 2017. [Online]. Available: http://arxiv.org/abs/1703.01161
-  T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in NIPS, 2016.
-  K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller, “Learning an embedding space for transferable robot skills,” in ICLR, 2018. [Online]. Available: https://openreview.net/forum?id=rk07ZXZRb
-  Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, and T. Darrell, “Reinforcement learning from imperfect demonstrations,” ICLR workshop, 2018. [Online]. Available: https://openreview.net/forum?id=BJJ9bz-0-
-  C.-A. Cheng, X. Yan, N. Wagener, and B. Boots, “Fast policy learning through imitation and reinforcement,” UAI, 2018.
-  W. Sun, J. A. Bagnell, and B. Boots, “Truncated horizon policy search: Combining reinforcement learning & imitation learning,” in ICLR, 2018.
-  T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys, “Deep Q-Learning from demonstrations,” AAAI, 2018.
-  A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in ICRA, 2018. [Online]. Available: https://doi.org/10.1109/ICRA.2018.8463162
-  Y. Zhu, Z. Wang, J. Merel, A. A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kramár, R. Hadsell, N. de Freitas, and N. Heess, “Reinforcement and imitation learning for diverse visuomotor skills,” arXiv, 2018. [Online]. Available: http://arxiv.org/abs/1802.09564
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
I. Kostrikov, “Pytorch implementations of reinforcement learning algorithms,”https://github.com/ikostrikov/pytorch-a2c-ppo-acktr, 2018.
E. Courmans and Y. Bai, “PyBullet, Python module for physics simulation, robotics and machine learning,”http://pybullet.org/, 2016-2017.
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An information-rich 3D model repository,” arXiv, 2015. [Online]. Available: http://arxiv.org/abs/1512.03012
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” CVPR, 2016.
-  E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in IROS, 2012. [Online]. Available: http://ieeexplore.ieee.org/document/6386109/
-  M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in NIPS, 2017.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” ICLR, 2016. [Online]. Available: http://arxiv.org/abs/1509.02971
We present additional qualitative results for the HRL-BC approach on the real robot with policies that have been trained in simulation, as in the main paper. Section VII-A describes and illustrates examples of UR5-Bowl and UR5-Breakfast policies. We demonstrate robot behavior while the robot is facing the challenges of previously unseen objects, dynamic changes of object locations and occlusions. We also illustrate feature map activations of the network providing better understanding of learned policies in Section VII-B.
Vii-a Qualitative results
UR5-Bowl: Multiple objects.
We experiment with the HRL-BC policy trained in the UR5-Bowl environment. Once the robot succeeds to place a cube in the bowl, we put another cube on the table and let the policy continue, see Figure 5. While the UR5-Bowl policy has been trained to handle one cube only, it automatically generalizes to multiple cubes when run in a loop.
UR5-Bowl: Previously unseen objects.
We further test the HRL-BC UR5-Bowl policy in the presence of previously unseen objects. While the policy has been trained to manipulate cubes of different sizes (see Section 5.1), we observe its robustness to other object shapes. As shown in Figure 6, the policy successfully grasps and places into a bowl real objects, such as apples, oranges, lemons, and toys. Notably, in cases of failing to grasps an object, the robot automatically recovers and completes the task. This behavior comes naturally from our HRL-BC master policy that has learned to adapt the sequence of skills given current observations of the scene.
UR5-Breakfast: New object instances.
To enable generalization of learned policies to new object instances, our UR5-Breakfast environment contains cups, bottles and bowls of different shapes from ShapeNet (see Section 5.1). During testing we run the learned HRL-BC UR5-Breakfast policy on a real robot and experiment with instances of bottles and cups unseen during training. Figure 7 demonstrates successful executions of the HRL-BC UR5-Breakfast policy in scenes with significant variations in object shapes, for example, using a wine glass instead of a cup.
UR5-Breakfast: Dynamic changes of object location.
Our BC skills make decisions at every time-step and, hence, can instantly adapt to changing conditions of the scene. We verify this by varying object positions during grasping attempts of the HRL-BC UR5-Breakfast policy. Figure 8 illustrates the reactive behavior of the robot grasping a cup that is being simultaneously moved by the person. The cup is successfully grasped after multiple changes of its position.
Another example of the instant re-planning by our HRL-BC policy is demonstrated in Figure 9. While the robot approaches an object, we temporary occlude the object and disrupt the executing BC skill. Given the hierarchical nature of our HRL-BC approach, the RL master is able to recover from this failure by starting another skill that leads to the completion of the task. More precisely, when the bottle gets occluded in the example of Figure 9, the robot changes its strategy and decides to grasp and pour from a cup. Once the occlusion is removed, the robot automatically resumes and completes the task by grasping and pouring from the bottle.
UR5-Breakfast: Failure case for BC-ordered.
Finally, we demonstrate the advantage of HRL-BC compared to the "BC-ordered" baseline with a fixed order of skills. For the UR5-Breakfast environment we define the order of skills for the BC-ordered policy as 1. "go to the cup", 2. "grasp the object and pour it to the bowl", 3. "release the object", 4. "go to the bottle", 5. "grasp the object and pour it to the bowl", 6. "release the object". While this policy succeeds in many cases, the near placement of objects presents a source of problems. Figure 10 shows an example scene with a cup and a bottle being near to each other while the cup is placed in front of the bottle. As the BC-ordered policy above is pre-programmed to grasp the cup first, the execution of this policy results in a collision between a gripper and a bottle, followed by the failure of the task. Notably, our HRL-BC policy learns to select the order of objects for grasping such as to avoid failures of the task. Hence, the HRL-BC automatically learns to avoid collisions without the need of specific intermediate rewards. This advantage of HRL-BC is demonstrated by quantitative results in Table 3.
Vii-B Feature map activations
The HRL-BC policy uses no explicit representation of scenes, for example in terms of categories and locations of objects. Some interpretation of learned policies, however, can be obtained by examining spatial activations of the neural network at intermediate network layers. Figure 11 shows silency maps of an HRL-BC policy highlighting which parts of the image the agents concentrates on. The silency maps are computed as activations of convolutional feature maps obtained from last layers of and (see Figure 1), averaged over all channels and layers. The resulting heatmaps are shown for different stages of the UR5-Breakfast task. Interestingly, while grasping and moving the bottle, the network generates highest activations around the bottle and the gripper, while ignoring other objects. When releasing the bottle, however, high activations are also observed around the cup. The attention to the cup might be explained by the need of avoiding collisions when placing the bottle on the table. Note, that we provide no intermediate rewards to RL, however, RL learns to avoid collisions since collisions imply failures of the final task and, hence, no final positive reward during training. Once the bottle is placed on the table, activations become low for the bottle, while the heatmap obtains maximum values for the manipulated cup. Observations of such feature maps have been useful in our work to identify certain cases of failures. We believe feature map activations are a useful tool to interpret learned policies.
|pouring from bottle|
|pouring from cup|